From: "jpl-coconut (Jacob Lacouture) via ruby-core" Date: 2025-11-13T23:34:00+00:00 Subject: [ruby-core:123795] [Ruby Bug#21685] Unnecessary context-switching, especially bad on multi-core machines. Issue #21685 has been reported by jpl-coconut (Jacob Lacouture). ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/