From: "luke-gru (Luke Gruber) via ruby-core" Date: 2025-09-19T18:51:51+00:00 Subject: [ruby-core:123309] [Ruby Bug#21612] Make sure we never context switch while holding the VM lock Issue #21612 has been updated by luke-gru (Luke Gruber). Well, I'm not sure if it should be allowed. The reason I said it should be is that currently, `EC_JUMP_TAG` is supported. If that is supported, then checking interrupts should be supported too because jumping can cause interrupts to be fired. We could avoid the context switch in that case. However, arbitrary ruby code can context switch in other ways like calling `sleep` or any blocking operation. ---------------------------------------- Bug #21612: Make sure we never context switch while holding the VM lock https://bugs.ruby-lang.org/issues/21612#change-114674 * Author: luke-gru (Luke Gruber) * Status: Open * Target version: 3.5 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ## The Problem We're seeing errors in our application that uses ractors. The errors look like: ``` [BUG] unexpected situation - recordd:1 current:0 error.c:1097 rb_bug_without_die_internal vm_sync.c:275 disallow_reentry eval_intern.h:136 rb_ec_vm_lock_rec_check eval_intern.h:147 rb_ec_tag_state vm.c:2619 rb_vm_exec vm.c:1702 rb_yield eval.c:1173 rb_ensure ``` We concluded that there was context switching going on while a thread held the VM lock. During the investigation into the issue, we added assertions in the code that we never yield to another thread with the VM lock held. We enabled these VM lock assertions even in single ractor mode. These assertions were failing in a few places, but most notably in finalizers. Finalizers are running with the VM lock held, and they were context switching and causing this issue. ## Why Is This Bad? There are a few reasons we shouldn't be able to context switch while holding the VM lock. In single-ractor mode with threads A and B: 1) Anything in this critical section should be thought of as a transaction related to the memory that's changed inside. if A has the lock, manipulates some global memory and yields to B with the lock still taken and without finishing the memory updates and then B takes it and starts writing to the same memory, the state of this global memory could be corrupted. Currently we don't actually take the VM lock in single-ractor mode, but that doesn't mean these issues can't happen. Yielding to another thread in the middle of manipulating global memory *can* still happen and it causes similar issues. In multi-ractor mode with ractors A and B: 1) We get the same issues as in single-ractor mode. 2) We can also get deadlocks if A has the lock, yields to B and B is blocked waiting on the lock. Unfortunately, many things can cause context switching in Ruby, so what is safe to call when the VM lock is taken? ## Guidelines I've come up with some guidelines. With the VM lock held, You should be able to: * Create ruby objects, call `ruby_xmalloc`, etc. * Jump using `EC_JUMP_TAG`. The lock will automatically be unlocked depending on how far up the call stack you locked it and where you're jumping to. * Check ruby interrupts. Since jumping can pop ruby frames and popping frames checks interrupts, you are allowed. It should never context switch with the VM lock held, even if the ruby thread's quantum is up. You shouldn't be able to: * Call any ruby method or enter Ruby's VM loop. For example, `rb_funcall` is not allowed, nor is `rb_warn` (it can call ruby code). `rb_sprintf` is not allowed because it can call `rb_inspect`. * Call `rb_nogvl` * Enter any blocking operation managed by Ruby. * Call a ruby-level mechanism that can context switch, like `rb_mutex_lock`. ## The Fix Of course, unlocking during finalizers is the main fix but there are other places that also need unlocking. I think adding assertions that the VM lock is not held will be important in finding these bugs and not creating regressions in the future. We don't have to add lots of these, just in a few places. These assertions, which only run in debug mode, should also run when in single-ractor mode. ## Future Work I think some documentation would be helpful for what is and isn't allowed while holding the VM lock and other locks in the cruby source. I am currently working on a `Concurrency Guide` for cruby developers that includes this info. It will not go over every lock, just the VM lock and the "all other locks" category. -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/