From: "luke-gru (Luke Gruber) via ruby-core" <ruby-core@...>
Date: 2025-11-20T23:38:05+00:00
Subject: [ruby-core:123874] [Ruby Bug#21685] Unnecessary context-switching, especially bad on multi-core machines.

Issue #21685 has been updated by luke-gru (Luke Gruber).


Thanks for taking a look at this and coming up with an implementation, this is great.

I haven't really played around with it much but I did read the code and I have a few thoughts:

* There's 1 deferred wait thread per ractor, which isn't ideal but I understand why you did it like that. It does look like it would work well for programs that don't use ractors. I was envisioning a thread that would deal with all ractors at once and instead of sleeping for a fixed time, would loop over a registered list (maybe a min-heap?) and sleep for a variable number of microseconds depending on the first thread's registration time.

* The deferred wait thread gets joined at ractor free time which isn't great. It would be better (more predictable, more safe) to join the thread at ractor termination. Since we only want 1 of these threads though, it could have a lifecycle similar to the timer thread (created on startup and on fork).

* There's an ABA issue with the thread pointer because in the unlikely scenario where the thread is freed while the deferred wait thread has a pointer to it and another thread is created with same memory address of the previous one, it would try to schedule that new thread. It would have to hold on to the thread's id which could be a monotonically increasing 64 bit uint. Similarly, deferred_wait_thread_count should be 64bit uint.

* deferred_wait_th_dummy is leaking :)

* 50 microseconds seems reasonable but I haven't played around with it.

Overall great job, and our team might look at this in the new year. If you want to work on it further or collaborate don't hesitate to reach out.

Thanks.

----------------------------------------
Bug #21685: Unnecessary context-switching, especially bad on multi-core machines.
https://bugs.ruby-lang.org/issues/21685#change-115276

* Author: jpl-coconut (Jacob Lacouture)
* Status: Open
* ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux]
* Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN
----------------------------------------
While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls.

I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable.

I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4.

After the fix:
- Single-core performance improves by 55%, from 3.6sec to 2sec.
- Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower.
- Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80
- system_time/user_time ratio drops from (1.2 - 1.6) to 0.65


Here are the benchmark results before my change:

```
# time taskset --cpu-list 1 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	1140773
nonvoluntary_ctxt_switches:	9487
real	0m3.619s
user	0m1.653s
sys	0m1.950s

# time taskset --cpu-list 1,2 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	1400110
nonvoluntary_ctxt_switches:	3
real	0m36.223s
user	0m9.380s
sys	0m14.927s
```


And after:
```
# time taskset --cpu-list 1 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	88
nonvoluntary_ctxt_switches:	899
real	0m2.031s
user	0m1.209s
sys	0m0.743s

# time taskset --cpu-list 1,2 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	75
nonvoluntary_ctxt_switches:	8
real	0m2.062s
user	0m1.279s
sys	0m0.783s
```

I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72).

Results summary:
- Single-core performance improves (this time by only 30%)
- Multi-core penalty drops from 4x to 0.
- No change to context-switching rates.
- system_time/user_time ratio drops from (0.5-1) to 0.15

Before fix:
```
# time taskset --cpu-list 1 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	60
real	0m0.336s
user	0m0.211s
sys	0m0.118s

# time taskset --cpu-list 1,2 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	60
real	0m1.424s
user	0m0.468s
sys	0m0.496s
```

After fix:
```
# time taskset --cpu-list 1 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	59
real	0m0.241s
user	0m0.202s
sys	0m0.032s

# time taskset --cpu-list 1,2 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	60
real	0m0.238s
user	0m0.195s
sys	0m0.035s
```


-- 
https://bugs.ruby-lang.org/
______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/