From: ruby@... Date: 2019-06-28T16:33:51+00:00 Subject: [ruby-core:93413] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals Issue #15940 has been updated by nirvdrum (Kevin Menard). I generally like the idea, but really from a semantics perspective rather than a memory savings one. It's confusing to both implementers and end users alike that Symbols take on a different encoding from Strings if they happen to be ASCII-only. So the other nice benefit of the change is `String#{intern,to_sym}` can be made much cheaper. Having said all of that, I'm sure the current behavior was maintained when non-ASCII-only Symbols were introduced for a reason. I think it'd be good to look back and see what the rationale was. If the solution then is to convert the String's encoding when calling `Symbol#to_s`, if the Symbol is ASCII-only, then I think you're going to investigate knock-on effects. E.g., `String#force_encoding` currently unconditionally clears the String's code range. That's metadata you really don't want to lose. But, by setting the encoding to ASCII-only, you may be okay most of the time because there are code paths that just check if the encoding uses single byte characters without doing a full code range scan. Likewise, if you do decide to skip the `US-ASCII` conversion, you could have the inverse problem. Now you have a UTF-8 string and if that doesn't have its code range set, you've turned some O(1) operations to O(n). Please note, I haven't really traced all the String and Symbol code. These were potential pitfalls that stood out to me when reviewing the PR and looking briefly at the CRuby source. My general point being that even if things come out correct, you could still alter the e xecution profile in such a way as to introduce a performance regression by changing from a fixed-width to a variable-width encoding or by not taking proper care of the code range value. None of that is insurmountable, of course. ---------------------------------------- Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals https://bugs.ruby-lang.org/issues/15940#change-78955 * Author: byroot (Jean Boussier) * Status: Open * Priority: Normal * Assignee: * Target version: ---------------------------------------- Patch: https://github.com/ruby/ruby/pull/2242 It's not uncommon for symbols to have literal string counterparts, e.g. ```ruby class User attr_accessor :name def as_json { 'name' => name } end end ``` Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8. Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals. The only notable behavioral change is `Symbol#to_s`. Previously `:name.to_s.encoding` would be `#`. After this patch it's `#`. I can't foresee any significant compatibility impact of this change on existing code. However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453 If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code: ```ruby def to_s str = fstr.dup str.force_encoding(Encoding::ASCII) if str.ascii_only? str end ``` -- https://bugs.ruby-lang.org/ Unsubscribe: