From: jean.boussier@... Date: 2019-06-19T15:55:05+00:00 Subject: [ruby-core:93251] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals Issue #15940 has been updated by byroot (Jean Boussier). In order to provide some data, I counted the duplicates in a Redmine heap dump (`ObjectSpace.dump_all`): Here the counting code: ```ruby #!/usr/bin/env ruby # frozen_string_literal: true require 'json' fstrings = [] STDIN.each do |line| object = JSON.parse(line) fstrings << object if object['fstring'] end counts = {} fstrings.each do |str| counts[str['value']] ||= 0 counts[str['value']] += 1 end duplicates = counts.select { |k, v| v > 1 }.map(&:first) puts "total fstrings: #{fstrings.size}" puts "dups: #{duplicates.size}" puts "sample:" puts duplicates.first(20) ``` And the results for Redmine: ``` total fstrings: 84678 dups: 3686 sample: changes absent part EVENTS RANGE OBJECT Silent EXCEPTION Settings DATE Index Graph COMPLEX Definition fcntl inline lockfile update gemfile oth ``` That's about 4% of the fstring table being duplicates. I also ran the script against one much bigger private app, and the duplicate ratio was similar, but the table was an order of magnitude bigger. ---------------------------------------- Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals https://bugs.ruby-lang.org/issues/15940#change-78701 * Author: byroot (Jean Boussier) * Status: Open * Priority: Normal * Assignee: * Target version: ---------------------------------------- Patch: https://github.com/ruby/ruby/pull/2242 It's not uncommon for symbols to have literal string counterparts, e.g. ```ruby class User attr_accessor :name def as_json { 'name' => name } end end ``` Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8. Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals. The only notable behavioral change is `Symbol#to_s`. Previously `:name.to_s.encoding` would be `#`. After this patch it's `#`. I can't foresee any significant compatibility impact of this change on existing code. However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453 If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code: ```ruby def to_s str = fstr.dup str.force_encoding(Encoding::ASCII) if str.ascii_only? str end ``` -- https://bugs.ruby-lang.org/ Unsubscribe: