From: usa@... Date: 2014-07-02T06:40:50+00:00 Subject: [ruby-core:63490] [ruby-trunk - Bug #9676] String#gsub shouldn't allocate so many Strings in its loop Issue #9676 has been updated by Usaku NAKAMURA. Backport changed from 2.0.0: UNKNOWN, 2.1: UNKNOWN to 2.0.0: WONTFIX, 2.1: UNKNOWN ---------------------------------------- Bug #9676: String#gsub shouldn't allocate so many Strings in its loop https://bugs.ruby-lang.org/issues/9676#change-47533 * Author: Sam Rawlins * Status: Closed * Priority: Normal * Assignee: * Category: * Target version: * ruby -v: trunk * Backport: 2.0.0: WONTFIX, 2.1: UNKNOWN ---------------------------------------- `rb_reg_search()` allocates (dups) a String to attach to the backreference object ( `RMATCH(match)->str = rb_str_new4(str);` ). If #gsub has been passed 2 arguments (not Enumerator form) and the second argument is a String, then it shouldn't make these allocations when calling `rb_reg_search()` inside it's loop. Here's an example: # gsub-allocates-too-much.rb require File.join(__dir__, "lib", "allocation_stats") def puts_object_list(name, stats) objects = stats.allocations.group_by(:sourcefile, :sourceline, :class).all. values.flatten.map(&:object). map {|o| o.is_a?(String) ? "#{o.inspect}<#{o.encoding.to_s}>" : o.inspect } puts "#{name} #{objects.flatten.size} new objects:" objects.group_by(&:hash).values.each { |ary| puts "#{ary.join(", ")}" } end slash = '/'; underscore = '_'; colon = ':' # allocate before the trace str = "12:34:45:67" stats = AllocationStats.trace { str.gsub(colon, underscore) } puts '> "12:34:45:67".gsub(":", "_")' puts_object_list("gsub substitutes 3x times:", stats) $ ruby ../allocation_stats/gsub-allocates-too-much.rb > "12:34:45:67".gsub(":", "_") gsub substitutes 3x times: 12 new objects: "12:34:45:67", "12:34:45:67", "12:34:45:67", "12:34:45:67" "12_34_45_67" ":", ":" ":", ":" # # /:/ The Strings that are copies of the original String are all unnecessary (except one, the last). I have a fix (attached and at [1]) that involves allocating the str attribute of the backreference object only when necessary. In order to do this without changing the signature of `rb_reg_search()`, this patch changes `rb_reg_search()` to wrap a new function `rb_reg_search0()`. So no calls to `rb_reg_search()` need to change, and `str_gsub()` changes two calls into `rb_reg_search0()` to avoid the allocations. (I believe String#split suffers from the same extra allocations, and can make a similar call to `rb_reg_search0()`.) The impact of this fix is primarily faster garbage collection. I have two "real world" examples: * ActiveRecord sqlite3 specs: total time in GC reduced from 11.2s to 10.4s (7% savings). * Mail gem specs: total time in GC reduced from 0.220s to 0.215s (2% savings). These numbers bounced around a lot though. I'm open to better benchmarking suggestions. I used ActiveRecord and Mail for real world examples of #gsub, where realistic Strings are gsubbed. [1] https://github.com/ruby/ruby/pull/578 ---Files-------------------------------- ruby-578.diff (2.56 KB) -- https://bugs.ruby-lang.org/