From: jean.boussier@...
Date: 2019-07-12T13:27:35+00:00
Subject: [ruby-core:93721] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals

Issue #15940 has been updated by byroot (Jean Boussier).


> First of all, this pull-request itself breaks non UTF-8 programs.

Could you elaborate on this? I don't understand what breaks in non UTF-8 programs. I ran some tests with `# encoding: EUC-JP` and can't find anything breaking.

However there was indeed a bug that would break at build time that I just fixed.

> It should be the source encoding instead of direct UTF-8.

Like @eregon I don't understand the rationale. Currently, regardless of the file encoding, ASCII only symbols are coerced to ASCII encoding.

> 4% of fstring table is only a fraction of total memory consumption. I am not sure how much effective.

Yes, I know it's not a big saving. I only submitted it because I couldn't see any real drawback to it, so a small gain for a small effort seemed worth it.

 I can already tell approximately how much it saves based on the Redmine benchmark. 3 686 string instances saved, the vast majority of them being embedded so `40 B`,  it's `147_440 B` (`147kB`).

Compared to `ObjectSpace.memsize_of_all` in the same process giving `48_149_329 B`. So in relative it's a `0.3%` saving overall (or even less because `ObjectSpace.memsize_of_all` isn't perfectly accurate). Which indeed isn't impressive at all. 

That being said, on our internal app that have a ~10x bigger fstring table, the duplication ratio is similar, so the saving would be over 1MB, which while still small relatively speaking, is significant in absolute.


So if you are sure this change would cause issues, then I'd rather close it now because I know the savings it brings won't ever justify it.


----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79354

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>