From: duerst@... Date: 2019-06-27T09:41:16+00:00 Subject: [ruby-core:93385] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals Issue #15940 has been updated by duerst (Martin D�rst). naruse (Yui NARUSE) wrote: > Note that an incompatibility which is caused by the change of string encoding is `String#<<(integer)`. > > Maybe String#<<(n) should be deprecated if n > 127 and explicitly specify the encoding argument. If I understand this correctly, the proposal is to change the encoding of Symbols from ASCII to UTF-8. So if such a symbol is converted to a String (which in itself may not be that frequent), and then an Integer is 'shifted' into that String with `<<`, then the only incompatibility that we get is that until now, it was an error to do that with a number > 127. So the overall consequence is that something that produced an error up to now doesn't produce an error anymore. I guess that's an incompatibility that we should be able to tolerate. It's much more of a problem if something that worked until now stops to work, or if something that worked one way suddenly works another way. As for explicitly specifying an encoding argument for `String#<<`, I understand that it may be the conceptually correct thing to do (we are using the Integer as a character number, so we better knew what encoding this character number was expressed in). But the encoding is already available from the string, and in most cases will be the source encoding or so anyway, which will be UTF-8 in most cases. Also, because `<<` is a binary operator, it would be difficult to add additional parameters. ---------------------------------------- Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals https://bugs.ruby-lang.org/issues/15940#change-78910 * Author: byroot (Jean Boussier) * Status: Open * Priority: Normal * Assignee: * Target version: ---------------------------------------- Patch: https://github.com/ruby/ruby/pull/2242 It's not uncommon for symbols to have literal string counterparts, e.g. ```ruby class User attr_accessor :name def as_json { 'name' => name } end end ``` Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8. Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals. The only notable behavioral change is `Symbol#to_s`. Previously `:name.to_s.encoding` would be `#`. After this patch it's `#`. I can't foresee any significant compatibility impact of this change on existing code. However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453 If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code: ```ruby def to_s str = fstr.dup str.force_encoding(Encoding::ASCII) if str.ascii_only? str end ``` -- https://bugs.ruby-lang.org/ Unsubscribe: