From: "kosaki (Motohiro KOSAKI)" Date: 2012-08-05T05:28:13+09:00 Subject: [ruby-core:46994] [ruby-trunk - Bug #6278] in `join': deadlock detected (fatal) Issue #6278 has been updated by kosaki (Motohiro KOSAKI). Good catch! For the record, I'd like to write a detail race. This race need three threads at least. 1) thread-A call rb_mutex_lock() and got mutex. 2) thread-B call rb_mutex_lock() and became sleeping in native_cond_wait(). 3) thread-A call rb_mutex_unlock() and call native_cond_signal(). 4) thread-A call rb_mutex_lock() and became sleeping in native_cond_wait(). 5) thread-C call rb_mutex_lock() and became sleeping in native_cond_timedwait(). i'm last one. 6) thread-B wake up by (3). 7) thread-B call rb_mutex_unlock() and call native_cond_signal(). 8) thread-B call rb_mutex_lock() and became sleeping in native_cond_timedwait(). i'm last one too. 9) thread-A and thread-B wake up. and try to get GVL at the same time. (btw, any thread combinations can make this race) Now, check_deadlock() from thread-A detect thread-B don't have a mutex and mutex->cond_waiting==0. However, the attached patch doesn't work correctly. When unlocking GVL, we can't touch th object. It's a part of ruby world. How is this? diff --git a/thread.c b/thread.c index d68fead..d0074a2 100644 --- a/thread.c +++ b/thread.c @@ -3913,6 +3913,13 @@ lock_interrupt(void *ptr) } /* + * At maximum, only one thread can use cond_timedwait and watch deadlock + * periodically. Multiple patrol thread (e.g. concurrent deadlock check) + * introduces new race conditions. + */ +rb_thread_t *patrol_thread = NULL; + +/* * call-seq: * mutex.lock -> self * @@ -3949,14 +3956,20 @@ rb_mutex_lock(VALUE self) * vm->sleepr is unstable value. we have to avoid both deadlock * and busy loop. */ - if (vm_living_thread_num(th->vm) == th->vm->sleeper) { + if ((vm_living_thread_num(th->vm) == th->vm->sleeper) && + !patrol_thread) { timeout_ms = 100; + patrol_thread = th; } + GVL_UNLOCK_BEGIN(); interrupted = lock_func(th, mutex, timeout_ms); native_mutex_unlock(&mutex->lock); GVL_UNLOCK_END(); + if (patrol_thread == th) + patrol_thread = NULL; + reset_unblock_function(th, &oldubf); th->locking_mutex = Qfalse; @@ -5199,6 +5212,7 @@ rb_check_deadlock(rb_vm_t *vm) if (vm_living_thread_num(vm) > vm->sleeper) return; if (vm_living_thread_num(vm) < vm->sleeper) rb_bug("sleeper must not be more than vm_living_threa + if (patrol_thread && patrol_thread != GET_THREAD()) return; ---------------------------------------- Bug #6278: in `join': deadlock detected (fatal) https://bugs.ruby-lang.org/issues/6278#change-28650 Author: kurt@intricatesoftware.com (Kurt Miller) Status: Assigned Priority: Normal Assignee: ko1 (Koichi Sasada) Category: Target version: ruby -v: ruby 1.9.3p125 (2012-02-16 revision 34643) [i386-openbsd] There is a race condition between native_sleep(), thread_start_func_2() where thread_start_func_2() holds the gvl lock, then in this bit of code: 514 /* wake up joining threads */ 515 join_th = th->join_list_head; 516 while (join_th) { 517 if (join_th == main_th) errinfo = Qnil; 518 rb_threadptr_interrupt(join_th); 519 switch (join_th->status) { 520 case THREAD_STOPPED: case THREAD_STOPPED_FOREVER: 521 printf("%p %d\n", join_th, join_th->status); 522 join_th->status = THREAD_RUNNABLE; 523 default: break; 524 } 525 join_th = join_th->join_list_next; 526 } 527 528 rb_threadptr_unlock_all_locking_mutexes(th); 529 if (th != main_th) rb_check_deadlock(th->vm); The other thread in native_sleep() is interrupted via the rb_threadptr_interrupt(join_th) call above. Then the join_th->status is set to THREAD_RUNNABLE above. The other thread blocks trying to get the gvl lock in native_sleep(): 914 GVL_UNLOCK_END(); Then the thread in thread_start_func_2() calls rb_check_deadlock(): 529 if (th != main_th) rb_check_deadlock(th->vm); which thinks the thread in native_sleep is deadlocked since it is not in the correct state. Here are the backtraces for the two threads, I added abort() in check_deadlock_i() when it decides there is a deadlock for a thread: (gdb) thread 1 [Switching to thread 1 (process 930810973)]#0 0x0d9b248d in kill () from /usr/lib/libc.so.63.0 (gdb) bt #0 0x0d9b248d in kill () from /usr/lib/libc.so.63.0 #1 0x0da1c925 in abort () at /usr/src/lib/libc/stdlib/abort.c:68 #2 0x0ddd5095 in check_deadlock_i (key=2081393040, val=695888864, found=0x7d300f38) at thread.c:4738 #3 0x0dd6b9ce in st_foreach (table=0x7c162d20, func=0xddd503f , arg=2100301624) at st.c:787 #4 0x0ddd526a in rb_check_deadlock (vm=0x7c034a00) at thread.c:4785 #5 0x0ddce8d1 in thread_start_func_2 (th=0x7c64be00, stack_start=0x7d301000) at thread.c:529 #6 0x0ddcd51e in thread_start_func_1 (th_ptr=0x7c64be00) at thread_pthread.c:653 #7 0x097a948e in _rthread_start (v=0x7c60f800) at rthread.c:113 #8 0x0d99ae81 in __tfork_thread () from /usr/lib/libc.so.63.0 (gdb) frame 5 #5 0x0ddce8d1 in thread_start_func_2 (th=0x7c64be00, stack_start=0x7d301000) at thread.c:529 529 if (th != main_th) rb_check_deadlock(th->vm); (gdb) thread 2 [Switching to thread 2 (process 207424605)]#0 0x0d9a9715 in _thread_sys___thrsleep () (gdb) bt #0 0x0d9a9715 in _thread_sys___thrsleep () from /usr/lib/libc.so.63.0 #1 0x097a7190 in pthread_cond_wait (condp=0x7c034a10, mutexp=0x7c034a08) at rthread_sync.c:473 #2 0x0ddcd0b4 in native_cond_wait (cond=0x7c034a10, mutex=0x7c034a08) at thread_pthread.c:304 #3 0x0ddccb15 in gvl_acquire_common (vm=0x7c034a00) at thread_pthread.c:61 #4 0x0ddccb7f in gvl_acquire (vm=0x7c034a00, th=0x7c034200) at thread_pthread.c:79 #5 0x0ddcd848 in native_sleep (th=0x7c034200, timeout_tv=0x0) at thread_pthread.c:914 #6 0x0ddcf2ea in sleep_forever (th=0x7c034200, deadlockable=1) at thread.c:856 #7 0x0ddcef97 in thread_join_sleep (arg=3485240628) at thread.c:689 #8 0x0dcbdda7 in rb_ensure (b_proc=0xddcef52 , data1=3485240628, e_proc=0xddceee6 , data2=3485240628) at eval.c:744 #9 0x0ddcf096 in thread_join (target_th=0x7c393600, delay=1e+30) at thread.c:722 #10 0x0ddcf191 in thread_join_m (argc=0, argv=0x7c05003c, self=2087355560) at thread.c:803 #11 0x0ddb7125 in call_cfunc (func=0xddcf113 , recv=2087355560, len=Variable "len" is not available. ) at vm_insnhelper.c:317 #12 0x0ddbf1bc in vm_call_method (th=0x7c034200, cfp=0x7c0cfef8, num=0, blockptr=0x1, flag=0, id=4736, me=0x7c166220, recv=2087355560) at vm_insnhelper.c:404 #13 0x0ddc2a60 in vm_exec_core (th=0x7c034200, initial=Variable "initial" is not available. ) at insns.def:1015 #14 0x0ddc9401 in vm_exec (th=0x7c034200) at vm.c:1220 #15 0x0ddc9bc0 in invoke_block_from_c (th=0x7c034200, block=0x7c0cff90, self=2081428100, argc=1, argv=0xcfbc9700, blockptr=0x0, cref=0x0) at vm.c:624 #16 0x0ddca167 in rb_yield (val=2087355560) at vm.c:654 #17 0x0dc8cad2 in rb_ary_each (array=2087355600) at array.c:1478 #18 0x0ddb710c in call_cfunc (func=0xdc8ca80 , recv=2087355600, len=Variable "len" is not available. ) at vm_insnhelper.c:320 #19 0x0ddbf1bc in vm_call_method (th=0x7c034200, cfp=0x7c0cff7c, num=0, blockptr=0x7c0cff91, flag=0, id=424, me=0x7c135940, recv=2087355600) at vm_insnhelper.c:404 #20 0x0ddc2a60 in vm_exec_core (th=0x7c034200, initial=Variable "initial" is not available. ) at insns.def:1015 #21 0x0ddc9401 in vm_exec (th=0x7c034200) at vm.c:1220 #22 0x0ddc95b7 in rb_iseq_eval_main (iseqval=2082861760) at vm.c:1461 #23 0x0dcbea29 in ruby_exec_internal (n=0x7c25f2c0) at eval.c:204 #24 0x0dcbea55 in ruby_exec_node (n=0x7c25f2c0) at eval.c:251 #25 0x0dcbfe2e in ruby_run_node (n=0x7c25f2c0) at eval.c:244 #26 0x1c000984 in main (argc=4, argv=0xcfbca128) at main.c:38 (gdb) frame 5 #5 0x0ddcd848 in native_sleep (th=0x7c034200, timeout_tv=0x0) at thread_pthread.c:914 914 GVL_UNLOCK_END(); $ ruby19 -v ruby 1.9.3p125 (2012-02-16 revision 34643) [i386-openbsd] To reproduce install ruby-sequel from http://sequel.rubyforge.org/ $ cat cpu_bench.rb require 'sequel' def a 1000.times{|j| Sequel.connect(:adapter=>:mock){}} end (0..10).map{Thread.new{a}}.each{|t| t.join} $ cat doit until [ $? -ne 0 ]; do ruby19 -I sequel/lib cpu_bench.rb done echo $? -- http://bugs.ruby-lang.org/