From: sam.rawlins@... Date: 2014-03-27T02:02:48+00:00 Subject: [ruby-core:61710] [ruby-trunk - Bug #9680] String#sub and siblings should not use regex when String pattern is passed Issue #9680 has been updated by Sam Rawlins. I think the speedup in this patch comes almost entirely from skipping the regex engine, not from the GC savings. Preserving `$&` (and `$~` and friends) while still not firing up the regex engine might be possible (constructing the basic backref, with no subgroups, by hand), but very very ugly code (an `RMatch` has an `RRegexp` and an `rmatch` which has a `re_registers`, etc). This might only a ~20 line function, but feels so dirty... I think an improvement (or replacement) to `reg_cache` would also be welcome. ---------------------------------------- Bug #9680: String#sub and siblings should not use regex when String pattern is passed https://bugs.ruby-lang.org/issues/9680#change-45955 * Author: Sam Rawlins * Status: Open * Priority: Normal * Assignee: * Category: * Target version: * ruby -v: trunk * Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN ---------------------------------------- Currently `String#sub`, `#sub!`, `#gsub, and `#gsub!` all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example, `"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC: * dup of the original String * result String * 2x `":"` * 2x `":"` * Regexp from pattern: `/:/` * `#` * `#` I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to `rb_reg_search()` are replaced with calls to a new function, `rb_pat_search()`, which conditionally calls `rb_reg_search()` or `rb_str_index()`, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String. Runtime of each method is dramatically reduced: require 'benchmark' n = 4_000_000 Benchmark.bm(7) do |bm| str1 = "123:456"; str2 = "123_456"; colon = ":"; underscore = "_" # each benchmark runs the substring method twice so that the bang methods can # perform the same number of substitutions to str1 each go around. bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } } bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } } bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } } bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } } end # trunk user system total real sub 40.450000 0.580000 41.030000 ( 41.209658) sub! 39.780000 0.580000 40.360000 ( 40.656789) gsub 58.500000 0.820000 59.320000 ( 59.603923) gsub! 59.400000 0.770000 60.170000 ( 60.435687) # this patch user system total real sub 3.060000 0.010000 3.070000 ( 3.091920) sub! 2.380000 0.010000 2.390000 ( 2.390769) gsub 7.130000 0.130000 7.260000 ( 7.299139) gsub! 7.660000 0.150000 7.810000 ( 7.846190) When using a String pattern, runtime is reduced by 87% to 94%. There is only one incompatibility that I am aware of: `$&` will not be set after using a sub method with a String pattern. (Subgroups (`$1`, ...) will not be available either, but weren't before, since String patterns are escaped before being used.) In the future, only 3 more methods use the function, `get_pat()`, that creates a Regexp from the String pattern: `#split`, `#scan`, and `#match`. I think this fix could be applied to these as well. ---Files-------------------------------- ruby-579.diff (5.12 KB) -- https://bugs.ruby-lang.org/