From: eregontp@... Date: 2019-06-28T09:26:16+00:00 Subject: [ruby-core:93402] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals Issue #15940 has been updated by Eregon (Benoit Daloze). duerst (Martin D�rst) wrote: > If I understand this correctly, the proposal is to change the encoding of Symbols from ASCII to UTF-8. So if such a symbol is converted to a String (which in itself may not be that frequent), and then an Integer is 'shifted' into that String with `<<`, then the only incompatibility that we get is that until now, it was an error to do that with a number > 127. > So the overall consequence is that something that produced an error up to now doesn't produce an error anymore. I guess that's an incompatibility that we should be able to tolerate. It's much more of a problem if something that worked until now stops to work, or if something that worked one way suddenly works another way. It's not raising an error: ``` $ ruby -ve 's=:abc.to_s; s<<233; p s; p s.encoding' ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-linux] "abc\xE9" # $ ruby -ve 's=:abc.to_s.force_encoding("UTF-8"); s<<233; p s; p s.encoding' ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-linux] "abc�" # ``` I'm a bit concerned about compatibility, I think we should evaluate with a few gems, and how much of test-all and specs fail with this change. I agree in general having a consistent encoding for Symbol literals seems simpler for semantics. TruffleRuby reuses the underlying memory (byte[], aka char*) for interned Strings of different encodings, so only the metadata (encoding, coderange, etc) is duplicated, but not the actual bytes. Probably MRI could do the same, and that would be transparent and not need to change semantics. ---------------------------------------- Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals https://bugs.ruby-lang.org/issues/15940#change-78944 * Author: byroot (Jean Boussier) * Status: Open * Priority: Normal * Assignee: * Target version: ---------------------------------------- Patch: https://github.com/ruby/ruby/pull/2242 It's not uncommon for symbols to have literal string counterparts, e.g. ```ruby class User attr_accessor :name def as_json { 'name' => name } end end ``` Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8. Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals. The only notable behavioral change is `Symbol#to_s`. Previously `:name.to_s.encoding` would be `#`. After this patch it's `#`. I can't foresee any significant compatibility impact of this change on existing code. However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453 If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code: ```ruby def to_s str = fstr.dup str.force_encoding(Encoding::ASCII) if str.ascii_only? str end ``` -- https://bugs.ruby-lang.org/ Unsubscribe: