[ruby-core:93454] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals

From: jean.boussier@...
Date: 2019-07-01 14:48:34 UTC
List: ruby-core #93454
Issue #15940 has been updated by byroot (Jean Boussier).


> US-ASCII is the natural subset for 7-bit characters, so it makes perfect sense to me that it's used for 7-bit symbols. UTF-8 is not, and is less precise than US-ASCII for that matter.

I don't disagree with this, but my point is that UTF-8 is a superset of US-ASCII, and much more likely to be the encoding of the various frozen string literals.

> At least performance-wise it shouldn't matter too much 

What do you mean by performance ? String comparisons ? If so it doesn't really matter much for symbols AFAIK.

> I'm unsure, it seems a bit arbitrary to give "ascii" symbols a UTF-8 encoding.

IMO there's two arguments here:

  - Consistency / Least surprise: UTF-8 is now the default source file encoding, it would make sense that the symbols created out of these files (not just `Symbol` instances, but module names, method names etc) would be UTF-8 as well.
  - Memory usage: as explained is the original issue description, it save some memory usage.


Honestly, what is surprising to me is this:

```ruby
'foo'.encoding # => UTF-8
:foo.to_s.encoding # => US-ASCII
module Foo; end
Foo.name.encoding # => US-ASCII
Foo.method(:name).name.encoding # => US-ASCII
:"ol蘂.to_s.encoding # => UTF-8
```

> Sharing char* is a more general optimization, and could apply to more cases (e.g., frozen Strings with identical bytes but different encodings).

The problem is that the different encoding have to be kept somewhere. So you end up with the original string plus some form of shared string that point to the original one and hold the different encoding.

So unless that string is too big to be embedded (rarely the case for symbols), you haven't actually saved anything.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79001

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```





-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

In This Thread

Prev Next