From: jean.boussier@... Date: 2019-07-01T11:36:04+00:00 Subject: [ruby-core:93449] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals Issue #15940 has been updated by byroot (Jean Boussier). Sorry for the late reply, somehow I can't make email notifications work on Redmine... > Specs can always be changed, along with ruby_version_is guards to specify which behavior on which version Thanks fro letting me know. I updated the PR, I expect it to pass CI, but will do further updates if it doesn't. > If we change this, the encoding of Symbol literals should be the same as String literals, i.e., use the file's magic encoding comment or UTF-8 if there isn't one. Yes and no. First it's kinda already the case and stays that way. If the symbol name can't be expressed as pure `ASCII`, it will have the string's encoding, hence the file encoding. However, one of the reason why the encoding is coerced, it's because if you have the following situation: ```ruby # encoding: iso-8659-1 ISO_SYMBOL = :foo # encoding: utf-8 UTF_SYMBOL = :foo ``` You do want both constants to reference the same symbol. From what I gathered it was the whole reason behind the ASCII coercion. > I'm sure the current behavior was maintained when non-ASCII-only Symbols were introduced for a reason. I believe it's the reason I described above. > If the solution then is to convert the String's encoding when calling Symbol#to_s Yeah, that was just a suggestion to retain `to_s` backward compatibility, but I really don't think it's a good idea. ---------------------------------------- Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals https://bugs.ruby-lang.org/issues/15940#change-78997 * Author: byroot (Jean Boussier) * Status: Open * Priority: Normal * Assignee: * Target version: ---------------------------------------- Patch: https://github.com/ruby/ruby/pull/2242 It's not uncommon for symbols to have literal string counterparts, e.g. ```ruby class User attr_accessor :name def as_json { 'name' => name } end end ``` Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8. Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals. The only notable behavioral change is `Symbol#to_s`. Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`. After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code. However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453 If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code: ```ruby def to_s str = fstr.dup str.force_encoding(Encoding::ASCII) if str.ascii_only? str end ``` -- https://bugs.ruby-lang.org/ Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe> <http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>