[ruby-core:93464] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
From:
jean.boussier@...
Date:
2019-07-02 00:49:39 UTC
List:
ruby-core #93464
Issue #15940 has been updated by byroot (Jean Boussier).
> would it help if Symbol.to_s or Module.name would return a shared string?
It's not really about the returned string, it's about the internal frozen string that is kept in the symbol table.
> or why not return a frozen string?
I already proposed it, but it was rejected for backward compatibilty concerns: https://bugs.ruby-lang.org/issues/15836
And again, it's kind of another topic entirely.
> I mean performance of String operations on a UTF-8 vs a US-ASCII String.
Right. What I was trying to say is that most of the time you compare the symbols directly, which doesn't involve string comparisons.
However it's true that performance might be impacted for operations done on the strings returned by `Symbol#to_s`.
I wonder wether the coderange could be eagerly set as in this case we do know it's 7-bit. I suppose so, I need to dig into that part of strings.
> Does this PR also addresses module and method names?
Yes, I think it does.
> it seems not to be a compatibility problem.
That doesn't surprise me one bit. I bet the vast majority of the strings returned by `Symbol#to_s` and `Module#name` end up converted to UTF-8 because they are concatenated with string literals which are UTF-8.
----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79014
* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee:
* Target version:
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242
It's not uncommon for symbols to have literal string counterparts, e.g.
```ruby
class User
attr_accessor :name
def as_json
{ 'name' => name }
end
end
```
Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.
Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.
The only notable behavioral change is `Symbol#to_s`.
Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.
However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453
If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:
```ruby
def to_s
str = fstr.dup
str.force_encoding(Encoding::ASCII) if str.ascii_only?
str
end
```
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>