From: "kjtsanaktsidis (KJ Tsanaktsidis) via ruby-core" Date: 2023-06-11T12:33:06+00:00 Subject: [ruby-core:113867] [Ruby master Bug#19717] `ConditionVariable#signal` is not fair when the wakeup is consistently spurious. Issue #19717 has been updated by kjtsanaktsidis (KJ Tsanaktsidis). I spent a bit of time looking into this issue. I think I understand _why_ it's happening now, but I don't quite know what, if anything, we should do about it. Firstly, I fixed up the reproduction issue in a couple of ways - after these changes, both Ruby implementations, as well as the pure-C++ reproduction, show the same issue. Fixes: https://github.com/KJTsanaktsidis/ruby-condition-variable-timeout Then, I also patched the Ruby interpreter to add a few more USDT tracepoints to diagnose the issue: https://github.com/ruby/ruby/compare/master...KJTsanaktsidis:ruby:ktsanaktsidis/add_thread_probes ---- So, what's happening, in short, is a combination of a few things: 1. If a single thread has a mutex, signals a condition variable, unlocks the mutex, and then locks it again, it's very likely that the thread will be successful in acquiring the mutex again the second time immediately. This is because whilst the condvar signal makes another thread _eligible_ to run, the original thread still holds the GVL and will do so until it sleeps or does blocking I/O of some kind (or finishes its timeslice, but that's not relevant here). 2. Ruby condition variables are fair (as a byproduct of Ruby mutexes being fair). We maintain a list of threads that are waiting to be signaled, and wake them up from oldest to newest. 3. Ruby of course doesn't understand whether or not a "resource" is available when using a condition variable. It will wake up a thread when the condvar is signaled, but it's up the user code to decide if it can actually make forward progress at that point (i.e. there are spurious wakeups, from an application perspective, although not spurious wakeups from Ruby's point of view - something really did signal the condvar). 4. Because of the combination of these factors, when you have one thread performing the actions from point 1, it's going to be doing a spurious wakeup on the condvar; whatever thread which wakes up won't be able to make progress because the first thread acquired the resource again. 5. With the right combination of threads, concecutive release-acquire operations on a resource, and predictable delays, this can lead to starvation. In the reproduction example above, you have three threads and two consecutive release-acquire operations. That means that half of the condvar wakeups will be spurious; the wakeups which are immediately followed by the same thread re-acquiring the resource don't help maintain forward progress. However, it _does_ put the spuriously-woken-up thread to the back of the waitq. The resource ends up ping-ponging between two of the threads whilst the third is eternally starved out. 6.I wrote up a detailed log of which thread does what in what order here: https://gist.github.com/KJTsanaktsidis/ddfbccc7b48d90277ec81ebf16591cb8 --- This is not a Ruby problem per-se; the C++ example in @ioquatix's repo exhibits the same issue (after I fixed it). I believe glibc's pthread condition variables have fair queues like in Ruby (although notably glibc's _mutexes_ are not fair). Ruby perhaps makes the issue _more likely_ because the GVL adds more determinsim though. The best idea I have right now to "fix" this is to have `ConditionVariable#wait` take an optional kwarg that lets the caller report whether their last wakeup was spurious or not; if set, we could prepend the caller to the waitq rather than appending them. It would be much better if we could solve this problem without adding any API surface though IMO. I feel like this must be a problem with a lot of literature around it but perhaps I don't know the right words to look for. ---------------------------------------- Bug #19717: `ConditionVariable#signal` is not fair when the wakeup is consistently spurious. https://bugs.ruby-lang.org/issues/19717#change-103515 * Author: ioquatix (Samuel Williams) * Status: Open * Priority: Normal * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- For background, see this issue . It looks like `ConditionVariable#signal` is not fair, if the calling thread immediately reacquires the resource. I've given a detailed reproduction here as it's non-trivial: . Because the spurious wakeup occurs, the thread is pushed to the back of the waitq, which means any other waiting thread will acquire the resource, and that thread will perpetually be at the back of the queue. I believe the solution is to change `ConditionVarialbe#signal` should only remove the thread from the waitq if it's possible to acquire the lock. Otherwise it should be left in place, so that the order is retained, this should result in fair scheduling. -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/