From: "ko1 (Koichi Sasada) via ruby-core" Date: 2025-12-09T18:07:28+00:00 Subject: [ruby-core:124112] [Ruby Bug#21685] Unnecessary context-switching, especially bad on multi-core machines. Issue #21685 has been updated by ko1 (Koichi Sasada). Previous machine is on WSL. On another vanilla Linux machine Ubuntu 24.04/Linux 6.8.0-87-generic/i7-13700H nproc:20: ``` all CPU 1 CPU MN=0 real: 0m8.919s real:0m7.874s voluntary_ctxt_switches: 1399816 voluntary_ctxt_switches: 1657676 nonvoluntary_ctxt_switches: 35 nonvoluntary_ctxt_switches: 48541 MN=1 real:0m10.118s real:0m6.915s voluntary_ctxt_switches: 1399058 voluntary_ctxt_switches: 1203349 nonvoluntary_ctxt_switches: 16 nonvoluntary_ctxt_switches: 20080 MN=1/tasks run on child threads (qstress(Queue) is on a child thread) real: 0m6.692s real:0m5.672s voluntary_ctxt_switches: 1 voluntary_ctxt_switches: 1 nonvoluntary_ctxt_switches: 1 nonvoluntary_ctxt_switches: 3 ``` There are difference but not significant on the description. It seems depends on CPU/Linux versions (scheduler). BTW single thread overhead is: ``` ko1@ruby-sp3:~/ruby/build/master$ time make run ./miniruby -I../../src/master/lib -I. -I.ext/common -r./x86_64-linux-fake ../../src/master/test.rb real 0m3.436s user 0m2.239s sys 0m1.197s ``` with the following script: ```ruby def qstress_single qclass q = qclass.new messages = [] # worker (0..1000000).each{|i| q.push i File.write('/dev/null', '0') # master m = q.pop messages << m if messages.count >= 5 messages.clear File.write('/dev/null', '0') end case m when :end break end } q.push :end end qstress_single ``` ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115557 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/