From: Eric Wong Date: 2018-08-19T09:57:26+00:00 Subject: [ruby-core:88552] Re: [Ruby trunk Bug#14999] ConditionVariable doesn't reacquire the Mutex if Thread#kill-ed eregontp@gmail.com wrote: > I think we need to give some thinking on this, > and I don't want to stress you to fix it to fix the build. > (although this should be solved soon IMHO, latest before the next preview/RC) It needs to be fixed ASAP. r64467 seems to make CI happy, at least (along with r64398, which was an absolutely necessary fix). The key difference with r64467 fix is I figured out I needed to reproduce the failure reliably by using a single CPU (schedtool -a 0x1 -e ...) > So I propose to "tag" the spec so it won't run in CI, so you can focus on it when there is time. > I'll do that tomorrow if you agree. Please don't. CI is happy at the moment, at least. > It would be also be interesting to see if Ruby <= 2.5 actually > ensured the semantics I described above. I was able to reproduce the problem with the old timer-thread using "schedtool -a 0x1". "git bisect" points to r63498 ("thread_pthread.c: enable thread cache by default"), which is HIGHLY unexpected. I'm not seeing how thread caching can be the cause of the bug, but rather it makes an existing bug more apparent by being faster and hitting the race sooner. The thread cache only operates at the pthreads level, so there's no way any Ruby-level interrupt could hit across Ruby Thread lifetimes. pthread_kill is never used in the Mutex/ConditionVariable code paths; and different pthreads condvars are used for sleeping/wakeups. So it is a mystery as to how thread caching ends up affecting Ruby-level interrupt handling... What thread-caching does do is make the MSPECOPT "-R" option noticeably faster Fwiw, I can still reproduce the failures with low timeouts: ``` diff --git a/thread_pthread.c b/thread_pthread.c index 2fd60ddd4a..e8da3ee9c2 100644 --- a/thread_pthread.c +++ b/thread_pthread.c @@ -187,7 +187,7 @@ do_gvl_timer(rb_vm_t *vm, rb_thread_t *th) if (vm->gvl.timer_err == ETIMEDOUT) { ts.tv_sec = 0; - ts.tv_nsec = TIME_QUANTUM_NSEC; + ts.tv_nsec = 1; ts = native_cond_timeout(&nd->cond.gvlq, ts); } vm->gvl.timer = th; ``` So there's definitely more to investigate until I'm satisified with the reliability of this (but I'm too tired, now). Unsubscribe: