From: takashikkbn@... Date: 2018-10-27T05:10:57+00:00 Subject: [ruby-core:89567] [Ruby trunk Bug#14867][Assigned] Process.wait can wait for MJIT compiler process Issue #14867 has been updated by k0kubun (Takashi Kokubun). Status changed from Closed to Assigned Recently we see 2 types of deadlocks on CI with --jit or --jit-wait. ## 1. waitpid on #system, #`, or Process.waitpid http://ci.rvm.jp/results/trunk-mjit-wait@silicon-docker/1431775 http://ci.rvm.jp/results/trunk-mjit-wait@silicon-docker/1431394 http://ci.rvm.jp/results/trunk-mjit-wait@silicon-docker/1430875 The stack trace on deadlock is like this: ~~~ 0x00007fbd6585ecf6 in __GI_ppoll (fds=fds@entry=0x7ffd30485118, nfds=nfds@entry=1, timeout=, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39 >>> Machine level backtrace Thread 3 (Thread 0x7fbd66c90700 (LWP 30964)): #0 0x00007fbd667af384 in __libc_read (fd=6, buf=0x7fbd58047c70, nbytes=1024) at ../sysdeps/unix/sysv/linux/read.c:27 #1 0x000055ac93a5c433 in rb_thread_io_blocking_region (func=func@entry=0x55ac93925110 , data1=data1@entry=0x7fbd66c8f4c0, fd=) at /home/ko1/ruby/src/trunk-mjit-wait/thread.c:1529 #2 0x000055ac93924ebe in rb_read_internal (count=, buf=, fd=) at /home/ko1/ruby/src/trunk-mjit-wait/io.c:987 #3 read_internal_call (arg=arg@entry=140451449992880) at /home/ko1/ruby/src/trunk-mjit-wait/io.c:2732 #4 0x000055ac938fa0d3 in rb_ensure (b_proc=0x55ac93924e80 , data1=140451449992880, e_proc=e_proc@entry=0x55ac93a286a0 , data2=) at /home/ko1/ruby/src/trunk-mjit-wait/eval.c:1052 #5 0x000055ac93a3c490 in rb_str_locktmp_ensure (str=, func=func@entry=0x55ac93924e80 , arg=arg@entry=140451449992880) at /home/ko1/ruby/src/trunk-mjit-wait/string.c:2654 #6 0x000055ac9393b99b in io_getpartial (argc=, argv=, io=, opts=opts@entry=8, nonblock=nonblock@entry=0) at /home/ko1/ruby/src/trunk-mjit-wait/io.c:2782 #7 0x000055ac9393bf51 in io_readpartial (argc=, argv=, io=) at /home/ko1/ruby/src/trunk-mjit-wait/io.c:2872 #8 0x000055ac93a90455 in vm_call_cfunc_with_frame (ci=0x55ac95418c40, cc=, calling=, reg_cfp=0x7fbd5f34bfa0, ec=0x55ac97772258) at /home/ko1/ruby/src/trunk-mjit-wait/vm_insnhelper.c:1914 ... (snip) ... #16 rb_vm_invoke_proc (ec=, proc=proc@entry=0x55ac99d5f2e0, argc=, argv=, passed_block_handler=passed_block_handler@entry=0) at /home/ko1/ruby/src/trunk-mjit-wait/vm.c:1191 #17 0x000055ac93a59615 in thread_do_start (args=94199785452720, th=0x55ac97a8d040) at /home/ko1/ruby/src/trunk-mjit-wait/thread.c:667 #18 thread_start_func_2 (th=th@entry=0x55ac97a8d040, stack_start=) at /home/ko1/ruby/src/trunk-mjit-wait/thread.c:706 #19 0x000055ac93a59acb in thread_start_func_1 (th_ptr=) at /home/ko1/ruby/src/trunk-mjit-wait/thread_pthread.c:1017 #20 0x00007fbd667a56db in start_thread (arg=0x7fbd66c90700) at pthread_create.c:463 #21 0x00007fbd6586b88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 2 (Thread 0x7fbd65749700 (LWP 30944)): #0 0x00007fbd667ab9f3 in futex_wait_cancelable (private=, expected=0, futex_word=0x55ac93e39428 ) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0x55ac93e394c0 , cond=0x55ac93e39400 ) at pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=cond@entry=0x55ac93e39400 , mutex=mutex@entry=0x55ac93e394c0 ) at pthread_cond_wait.c:655 #3 0x000055ac93a566a9 in rb_native_cond_wait (cond=cond@entry=0x55ac93e39400 , mutex=mutex@entry=0x55ac93e394c0 ) at /home/ko1/ruby/src/trunk-mjit-wait/thread_pthread.c:496 #4 0x000055ac9395cc19 in copy_cache_from_main_thread (job=0x7fbd65748e60) at /home/ko1/ruby/src/trunk-mjit-wait/mjit_worker.c:1189 #5 mjit_worker () at /home/ko1/ruby/src/trunk-mjit-wait/mjit_worker.c:1247 #6 0x000055ac93a51d0a in mjit_worker (arg=0x55ac9395c770 ) at /home/ko1/ruby/src/trunk-mjit-wait/thread_pthread.c:1866 #7 0x00007fbd667a56db in start_thread (arg=0x7fbd65749700) at pthread_create.c:463 #8 0x00007fbd6586b88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 1 (Thread 0x7fbd66df3740 (LWP 30930)): #0 0x00007fbd6585ecf6 in __GI_ppoll (fds=fds@entry=0x7ffd30485118, nfds=nfds@entry=1, timeout=, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39 #1 0x000055ac93a58f5d in native_ppoll_sleep (th=th@entry=0x55ac952bde60, rel=rel@entry=0x0) at /home/ko1/ruby/src/trunk-mjit-wait/thread_pthread.c:2015 #2 0x000055ac93a5b22a in native_sleep (th=th@entry=0x55ac952bde60, rel=0x0) at /home/ko1/ruby/src/trunk-mjit-wait/thread_pthread.c:2051 #3 0x000055ac93a5baeb in rb_thread_sleep_interruptible () at /home/ko1/ruby/src/trunk-mjit-wait/thread.c:1270 #4 0x000055ac939bb4b5 in waitpid_sleep (x=x@entry=140725413499744) at /home/ko1/ruby/src/trunk-mjit-wait/process.c:1099 #5 0x000055ac938fa0d3 in rb_ensure (b_proc=b_proc@entry=0x55ac939bb4a0 , data1=data1@entry=140725413499744, e_proc=e_proc@entry=0x55ac939bb450 , data2=data2@entry=140725413499744) at /home/ko1/ruby/src/trunk-mjit-wait/eval.c:1052 #6 0x000055ac939c2cf9 in waitpid_wait (w=0x7ffd30485360) at /home/ko1/ruby/src/trunk-mjit-wait/process.c:1158 #7 rb_waitpid (pid=, st=0x7ffd304853c4, flags=) at /home/ko1/ruby/src/trunk-mjit-wait/process.c:1197 #8 0x000055ac939c4e6e in rb_f_system (argc=3, argv=0x7fbd66cf2338) at /home/ko1/ruby/src/trunk-mjit-wait/process.c:4371 #9 0x000055ac93a90455 in vm_call_cfunc_with_frame (ci=0x55ac98d41800, cc=, calling=, reg_cfp=0x7fbd66df1cc8, ec=0x55ac952be2e8) at /home/ko1/ruby/src/trunk-mjit-wait/vm_insnhelper.c:1914 ... (snip) ... #35 0x000055ac938f90fb in ruby_exec_node (n=) at /home/ko1/ruby/src/trunk-mjit-wait/eval.c:325 #36 ruby_run_node (n=) at /home/ko1/ruby/src/trunk-mjit-wait/eval.c:317 #37 0x000055ac938f41df in main (argc=, argv=) at /home/ko1/ruby/src/trunk-mjit-wait/main.c:42 >>> Dump Ruby level backtrace th: 0x55ac952bde60, native_id: 0x7fbd66df3740 -- Control frame information ----------------------------------------------- c:0016 p:---- s:0107 e:000106 CFUNC :system c:0015 p:0024 s:0100 e:000099 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/fileutils/test_fileutils.rb:148 c:0014 p:0098 s:0095 e:000094 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/fileutils/test_fileutils.rb:167 c:0013 p:0044 s:0090 e:000089 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/minitest/unit.rb:1289 c:0012 p:0015 s:0081 e:000080 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/testcase.rb:18 c:0011 p:0082 s:0076 e:000075 BLOCK /home/ko1/ruby/src/trunk-mjit-wait/test/lib/minitest/unit.rb:952 [FINISH] c:0010 p:---- s:0069 e:000068 CFUNC :map c:0009 p:0133 s:0065 e:000064 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/minitest/unit.rb:945 c:0008 p:0045 s:0053 e:000052 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit.rb:1044 c:0007 p:0095 s:0046 E:0020b0 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:50 c:0006 p:0009 s:0030 e:000029 BLOCK /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:25 [FINISH] c:0005 p:---- s:0026 e:000025 CFUNC :map c:0004 p:0006 s:0022 e:000021 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:24 c:0003 p:0263 s:0016 E:001aa0 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:122 c:0002 p:0128 s:0006 E:0021f0 EVAL /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:207 [FINISH] c:0001 p:0000 s:0003 E:0021f0 (none) [FINISH] th: 0x55ac97a8d040, native_id: 0x7fbd66c90700 -- Control frame information ----------------------------------------------- c:0003 p:---- s:0012 e:000011 CFUNC :readpartial c:0002 p:0036 s:0007 e:000006 BLOCK /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:39 [FINISH] c:0001 p:---- s:0003 e:000002 (none) [FINISH] ~~~ In this case, 3 threads are blocking in: 1. `rb_thread_io_blocking_region` called from `rb_read_internal` called from `io_readpartial` 2. `native_ppoll_sleep` called inside `rb_waitpid` 3. (MJIT worker) `rb_native_cond_wait` called from `copy_cache_from_main_thread` I think 3's lock is completely independent from blocking in 1 and 2, and I have no idea why 1 and 2 are blocking in that place forever. ## 2. in ruby_cleanup http://ci.rvm.jp/results/trunk-mjit@silicon-docker/1428895 http://ci.rvm.jp/results/trunk-mjit-wait@silicon-docker/1429135 The stack trace on deadlock is like this: ~~~ 0x00005587e954031c in verbose (level=3, format=format@entry=0x5587e973f692 "Unlocked %s", level=3) at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:289 289 verbose(int level, const char *format, ...) >>> Machine level backtrace Thread 3 (Thread 0x7fd7f439d700 (LWP 11123)): #0 0x00007fd7f3e57ed9 in futex_reltimed_wait_cancelable (private=, reltime=0x7fd7f439cd70, expected=0, futex_word=0x7fd7f439ce88) at ../sysdeps/unix/sysv/linux/futex-internal.h:142 #1 __pthread_cond_wait_common (abstime=0x7fd7f439ce50, mutex=0x5587e9a1f660 , cond=0x7fd7f439ce60) at pthread_cond_wait.c:533 #2 __pthread_cond_timedwait (cond=cond@entry=0x7fd7f439ce60, mutex=mutex@entry=0x5587e9a1f660 , abstime=abstime@entry=0x7fd7f439ce50) at pthread_cond_wait.c:667 #3 0x00005587e963eca8 in native_cond_timedwait (abs=, mutex=0x5587e9a1f660 , cond=0x7fd7f439ce60) at /home/ko1/ruby/src/trunk-mjit/thread_pthread.c:516 #4 register_cached_thread_and_wait (altstack=0x7fd7e0000b20) at /home/ko1/ruby/src/trunk-mjit/thread_pthread.c:1079 #5 thread_start_func_1 (th_ptr=) at /home/ko1/ruby/src/trunk-mjit/thread_pthread.c:1024 #6 0x00007fd7f3e516db in start_thread (arg=0x7fd7f439d700) at pthread_create.c:463 #7 0x00007fd7f2f1788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 2 (Thread 0x7fd7f2df5700 (LWP 17017)): #0 0x00007fd7f2f0acf6 in __GI_ppoll (fds=fds@entry=0x7fd7f2def8e8, nfds=nfds@entry=1, timeout=, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39 #1 0x00005587e963c757 in rb_sigwait_sleep (th=th@entry=0x0, sigwait_fd=sigwait_fd@entry=3, rel=rel@entry=0x0) at /home/ko1/ruby/src/trunk-mjit/thread_pthread.c:1959 #2 0x00005587e95a7a7c in ruby_waitpid_locked (vm=vm@entry=0x5587ea98a950, pid=pid@entry=11124, status=status@entry=0x7fd7f2def9dc, options=options@entry=0, cond=cond@entry=0x7fd7f2def9e0) at /home/ko1/ruby/src/trunk-mjit/process.c:1070 #3 0x00005587e953fd02 in exec_process (path=, argv=argv@entry=0x7fd7ec01ced0) at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:657 #4 0x00005587e9541276 in compile_c_to_o (o_file=0x7fd7f2defa80 "/tmp/_ruby_mjit_p17016u76.o", c_file=0x7fd7f2defb70 "/tmp/_ruby_mjit_p17016u76.c") at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:842 #5 convert_unit_to_func (unit=0x5587eaf601f0, cc_entries=, is_entries=) at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:1122 #6 0x00005587e9541af6 in mjit_worker () at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:1253 #7 0x00005587e9636dfa in mjit_worker (arg=0x5587e9541830 ) at /home/ko1/ruby/src/trunk-mjit/thread_pthread.c:1866 #8 0x00007fd7f3e516db in start_thread (arg=0x7fd7f2df5700) at pthread_create.c:463 #9 0x00007fd7f2f1788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 1 (Thread 0x7fd7f449f740 (LWP 17016)): #0 0x00005587e954031c in verbose (level=3, format=format@entry=0x5587e973f692 "Unlocked %s", level=3) at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:289 #1 0x00005587e95437a3 in CRITICAL_SECTION_FINISH (msg=0x5587e973f8e8 "in stop_worker", level=3) at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:448 #2 stop_worker () at /home/ko1/ruby/src/trunk-mjit/mjit.c:659 #3 mjit_finish () at /home/ko1/ruby/src/trunk-mjit/mjit.c:734 #4 0x00005587e94dde81 in ruby_cleanup (ex=) at /home/ko1/ruby/src/trunk-mjit/eval.c:236 #5 0x00005587e94de102 in ruby_run_node (n=) at /home/ko1/ruby/src/trunk-mjit/eval.c:317 #6 0x00005587e94d91df in main (argc=, argv=) at /home/ko1/ruby/src/trunk-mjit/main.c:42 >>> Dump Ruby level backtrace th: 0x5587ea98ae60, native_id: 0x7fd7f449f740 -- Control frame information ----------------------------------------------- c:0001 p:0000 s:0003 E:002360 (none) [FINISH] ~~~ In this case, 3 threads are blocking in: 1. `native_cond_timedwait` called from `register_cached_thread_and_wait` 2. (MJIT worker) `rb_sigwait_sleep` called from `ruby_waitpid_locked` called from `compile_c_to_o` 3. (main thread) looping inside `stop_worker` called from `ruby_cleanup` 1 looks innocent and ignoreable. In 2, somehow it seems to have lost the process to wait, or locked with VM's lock. If the situation is the former, sometimes this CI machine is overloaded and thus it may happen on such an environment. And if the situation is the latter, I have no idea why it's locked. @normalperson While I'm taking a look at this by myself, it would be helpful if you have any insight about this, since `waitpid` thing is involved in both issues. For now I have no approach to fix this because it's not reproductive on my local machine... ---------------------------------------- Bug #14867: Process.wait can wait for MJIT compiler process https://bugs.ruby-lang.org/issues/14867#change-74615 * Author: k0kubun (Takashi Kokubun) * Status: Assigned * Priority: Normal * Assignee: normalperson (Eric Wong) * Target version: * ruby -v: * Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN ---------------------------------------- If Ruby tries to wait for any child process, MJIT's gcc/clang process could be caught by the method call. It's not convenient for both Ruby's user and MJIT worker thread, so Process.wait and its families should somehow avoid waiting for it. ---Files-------------------------------- 0001-hijack-SIGCHLD-handler-for-internal-use.patch (13.8 KB) JIT-test-all.log (39.9 KB) mjit_test-all_63796.log (40.4 KB) config_ruby-loco_mingw.log (27 KB) test_jit_results.txt (41.2 KB) -- https://bugs.ruby-lang.org/ Unsubscribe: