From: Eric Wong Date: 2018-10-27T08:51:17+00:00 Subject: [ruby-core:89571] Re: [Ruby trunk Bug#14867][Assigned] Process.wait can wait for MJIT compiler process takashikkbn@gmail.com wrote: > In this case, 3 threads are blocking in: > > 1. `rb_thread_io_blocking_region` called from `rb_read_internal` called from `io_readpartial` > 2. `native_ppoll_sleep` called inside `rb_waitpid` > 3. (MJIT worker) `rb_native_cond_wait` called from `copy_cache_from_main_thread` rb_postponed_job_register only sets a flag, but doesn't wake up sleeping the thread in 1. or 2. by calling ubf.func (via rb_threadptr_interrupt). This is a tricky situation... Calling ubf.func is NOT async-signal-safe, so rb_postponed_job_register may not use it by default, either. Also ruby_current_execution_context_ptr variable is unstable between setting ec->interrupt_flag (via RUBY_VM_SET_POSTPONED_JOB_INTERRUPT) and ubf.func calls since we make them without GVL This is a similar situation to [Bug #14939] r64062 > I think 3's lock is completely independent from blocking in 1 > and 2, and I have no idea why 1 and 2 are blocking in that > place forever. I'm not sure if rb_postponed_job_register is the right tool in a multi-threaded situation. It seems like the "postponed" part is a bad fit for MJIT anyways. Anyways, the above issue is pretty straightforward, I think. > ## 2. in ruby_cleanup > > In this case, 3 threads are blocking in: Not sure about this one, yet: > 1. `native_cond_timedwait` called from `register_cached_thread_and_wait` > 2. (MJIT worker) `rb_sigwait_sleep` called from `ruby_waitpid_locked` called from `compile_c_to_o` > 3. (main thread) looping inside `stop_worker` called from `ruby_cleanup` > > 1 looks innocent and ignoreable. Not sure, this is a timeout situation? THREAD_CACHE_TIME is only 3 seconds, so I think the cache entry would've timed out if a whole test hits timeout. > In 2, somehow it seems to have lost the process to wait, or > locked with VM's lock. If the situation is the former, > sometimes this CI machine is overloaded and thus it may happen > on such an environment. And if the situation is the latter, I > have no idea why it's locked. Can you tell if the process 2. is waiting on is still a zombie? To debug, maybe always return &busy_wait from sigwait_sleep_time and check the contents of vm->waiting_pids periodically. You may also periodically kill(pid, 0) to see if the process is killable. Unsubscribe: