From: "peterzhu2118 (Peter Zhu) via ruby-core" Date: 2024-09-02T16:49:43+00:00 Subject: [ruby-core:119015] [Ruby master Bug#20710] Reducing Hash allocation introduces large performance degradation (probably related to VWA) Issue #20710 has been updated by peterzhu2118 (Peter Zhu). Thank you for the detailed report. I had a discussion about this with @mame and @byroot last year in the Ruby core slack: https://ruby.slack.com/archives/C02A3SL0S/p1702910003614609 One of the big differences is that VWA eagerly allocates the AR/ST table for the hash, rather than lazily allocate it. This means that for empty hashes, there is a performance penalty for empty hashes, but the penalty can be reclaimed when we have elements inside of the hash. Another issue with microbenchmarks is that some of the size pools may be small since there aren't many objects, so it may cause GC to trigger more frequently. We don't see this kind of issues with macrobenchmarks since they keep more objects alive after bootup. I can see that you've observed this issue in your benchmark as well. In the following microbenchmark, we can see that with hashes <= 8 elements (i.e. AR hashes), the performance in Ruby 3.3 is basically on-par with Ruby 3.2, but when we switch to ST tables (9 elements), we can see that VWA is significantly faster: ``` ruby 3.2.4 (2024-04-23 revision af471c0e01) [arm64-darwin23] user system total real Hash with 0 elements 2.421699 0.013686 2.435385 ( 2.436581) Hash with 1 elements 2.955857 0.029542 2.985399 ( 3.014737) Hash with 2 elements 2.891668 0.019301 2.910969 ( 2.921928) Hash with 3 elements 2.900170 0.015396 2.915566 ( 2.916644) Hash with 4 elements 2.889895 0.014969 2.904864 ( 2.905188) Hash with 5 elements 2.895059 0.017253 2.912312 ( 2.912845) Hash with 6 elements 2.869016 0.014351 2.883367 ( 2.883618) Hash with 7 elements 2.907134 0.016862 2.923996 ( 2.924871) Hash with 8 elements 2.926749 0.020445 2.947194 ( 2.956753) Hash with 9 elements 19.932546 0.551577 20.484123 ( 20.498173) ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin23] user system total real Hash with 0 elements 2.591444 0.023060 2.614504 ( 2.616658) Hash with 1 elements 3.052488 0.030433 3.082921 ( 3.102709) Hash with 2 elements 3.064385 0.027627 3.092012 ( 3.106096) Hash with 3 elements 3.036935 0.023353 3.060288 ( 3.063819) Hash with 4 elements 3.020218 0.022274 3.042492 ( 3.043182) Hash with 5 elements 3.053680 0.025551 3.079231 ( 3.083070) Hash with 6 elements 2.991555 0.023347 3.014902 ( 3.017601) Hash with 7 elements 3.011856 0.026142 3.037998 ( 3.041611) Hash with 8 elements 3.044671 0.033276 3.077947 ( 3.109949) Hash with 9 elements 14.873814 0.400856 15.274670 ( 15.309215) ``` ```ruby require "benchmark" TIMES = 100_000_000 Benchmark.bm do |x| x.report("Hash with 0 elements") do TIMES.times { {} } end x.report("Hash with 1 elements") do TIMES.times { { a: 0 } } end x.report("Hash with 2 elements") do TIMES.times { { a: 0, b: 0 } } end x.report("Hash with 3 elements") do TIMES.times { { a: 0, b: 0, c: 0 } } end x.report("Hash with 4 elements") do TIMES.times { { a: 0, b: 0, c: 0, d: 0 } } end x.report("Hash with 5 elements") do TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0 } } end x.report("Hash with 6 elements") do TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0 } } end x.report("Hash with 7 elements") do TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0 } } end x.report("Hash with 8 elements") do TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0 } } end x.report("Hash with 9 elements") do TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0, i: 0 } } end end ``` ---------------------------------------- Bug #20710: Reducing Hash allocation introduces large performance degradation (probably related to VWA) https://bugs.ruby-lang.org/issues/20710#change-109590 * Author: pocke (Masataka Kuwabara) * Status: Open * ruby -v: ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin21] * Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- I found a surprising performance degradation while developing RBS. In short, I tried to remove unnecessary Hash allocations for RBS. Then, it made the execution time 2x slower. VWA for Hash probably causes this degradation. I'd be happy if we could mitigate the impact by updating the memory management strategy. ## Reproduce You can reproduce this problem on a PR in pocke/rbs repository. https://github.com/pocke/rbs/pull/2 This PR dedups empty Hash objects. 1. `git clone` and checkout 1. `bundle install` 1. `bundle exec rake compile` for C-ext 1. `bundle ruby benchmark/benchmark_new_env.rb` The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429cfdb034f88a74a6f94420fd21. The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93720392252d3a3980d7bf335. The benchmark results are the following: ``` # Before $ bundle exec ruby benchmark/benchmark_new_env.rb (snip) new_env 6.426 (�15.6%) i/s - 64.000 in 10.125442s new_rails_env 0.968 (� 0.0%) i/s - 10.000 in 10.355738s # After $ bundle exec ruby benchmark/benchmark_new_env.rb (snip) new_env 4.371 (�22.9%) i/s - 43.000 in 10.150192s new_rails_env 0.360 (� 0.0%) i/s - 4.000 in 11.313158s ``` The IPS decreased 1.47x for `new_env` case (parsing small RBS env), and 2.69x for `new_rails_env` (parsing large RBS env). ## Investigation ### GC.stat `GC.stat` indicates the number of minor GCs increases. ```ruby # In the RBS repository require_relative './benchmark/utils' tmpdir = prepare_collection! new_rails_env(tmpdir) pp GC.stat ``` ``` # before {:count=>126, :time=>541, :marking_time=>496, :sweeping_time=>45, :heap_allocated_pages=>702, :heap_sorted_length=>984, :heap_allocatable_pages=>282, :heap_available_slots=>793270, :heap_live_slots=>787407, :heap_free_slots=>5863, :heap_final_slots=>0, :heap_marked_slots=>757744, :heap_eden_pages=>702, :heap_tomb_pages=>0, :total_allocated_pages=>702, :total_freed_pages=>0, :total_allocated_objects=>2220605, :total_freed_objects=>1433198, :malloc_increase_bytes=>5872, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>112, :major_gc_count=>14, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>0, :remembered_wb_unprotected_objects_limit=>4779, :old_objects=>615704, :old_objects_limit=>955872, :oldmalloc_increase_bytes=>210912, :oldmalloc_increase_bytes_limit=>16777216} # after {:count=>255, :time=>1551, :marking_time=>1496, :sweeping_time=>55, :heap_allocated_pages=>570, :heap_sorted_length=>1038, :heap_allocatable_pages=>468, :heap_available_slots=>735520, :heap_live_slots=>731712, :heap_free_slots=>3808, :heap_final_slots=>0, :heap_marked_slots=>728727, :heap_eden_pages=>570, :heap_tomb_pages=>0, :total_allocated_pages=>570, :total_freed_pages=>0, :total_allocated_objects=>2183278, :total_freed_objects=>1451566, :malloc_increase_bytes=>1200, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>242, :major_gc_count=>13, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>0, :remembered_wb_unprotected_objects_limit=>5915, :old_objects=>600594, :old_objects_limit=>1183070, :oldmalloc_increase_bytes=>8128, :oldmalloc_increase_bytes_limit=>16777216} ``` ### Warming up Hashes The following patch, which creates unnecessary Hash objects before the benchmark, improves the execution time. ```diff diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb index 6dd2b73f..a8da61c6 100644 --- a/benchmark/benchmark_new_env.rb +++ b/benchmark/benchmark_new_env.rb @@ -4,6 +4,8 @@ require 'benchmark/ips' tmpdir = prepare_collection! +(0..30_000_000).map { {} } + Benchmark.ips do |x| x.time = 10 ``` The results are the following: ``` # Before Calculating ------------------------------------- new_env 10.354 (� 9.7%) i/s - 103.000 in 10.013834s new_rails_env 1.661 (� 0.0%) i/s - 17.000 in 10.282490s # After Calculating ------------------------------------- new_env 10.771 (� 9.3%) i/s - 107.000 in 10.010446s new_rails_env 1.584 (� 0.0%) i/s - 16.000 in 10.178984s ``` ### `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO` The `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO` env var also mitigates the performance impact. In this example, I set `RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6` (default: 0.20) ```console # Before Calculating ------------------------------------- new_env 10.271 (� 9.7%) i/s - 102.000 in 10.087191s new_rails_env 1.529 (� 0.0%) i/s - 16.000 in 10.538043s # After $ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6 bundle exec ruby benchmark/benchmark_new_env.rb Calculating ------------------------------------- new_env 11.003 (� 9.1%) i/s - 110.000 in 10.068428s new_rails_env 1.347 (� 0.0%) i/s - 14.000 in 11.117665s ``` ## Additional Information * I applied the same change to Array. But it does not cause this problem. * I guess the cause is the difference of the Size Pool. An empty Array uses 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes. * The Size Pool for 160 bytes objects has fewer objects than the 40 bytes one. So, reducing allocation affects the performance sensitively. * I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the execution time. * VWA for Hash is introduced since Ruby 3.3. https://github.com/ruby/ruby/blob/73c39a5f93d3ad4514a06158e2bb7622496372b9/doc/NEWS/NEWS-3.3.0.md#gc--memory-management ## Acknowledgement @mame, @ko1, and @soutaro helped the investigation. I would like to thank them. -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/