From: xkernigh@...
Date: 2017-09-30T03:21:49+00:00
Subject: [ruby-core:83064] [Ruby trunk Bug#13794] Infinite loop of	sched_yield

Issue #13794 has been updated by kernigh (George Koehler).


Gregory Potamianos wrote:
> `while true; do nice -n19 ruby sched_yield_loop.rb; [ $? -ne 0 ] && break; done`

With your script, it is easy to reproduce the bug.
I shortened your shell loop to
`while nice -n19 ruby sched_yield_loop.rb; do done`

I tested both patches by Eric Wong,
the weaker version with PID check
> https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw

and the stronger version with PID check and zeroing .writing
> https://80x24.org/spew/20170828232657.GA22848@dcvr/raw

I used this Ruby,
```
$ ruby -v
ruby 2.5.0dev (2017-09-30 trunk 60064) [x86_64-openbsd6.1]
```

The shell loop running sched_yield_loop.rb can run for a few minutes before the bug happens. It happens when sched_yield_loop.rb raises a timeout error; then I find a child Ruby spinning the CPU, as Gregory Potamianos described. Gregory, running Debian, reported that the weaker patch seems to fix the bug. I, running OpenBSD, observe that neither patch fixes the bug. I can still get the timeout error and the spinning child when Ruby is without patch, with the weaker patch, or with the stronger patch.

But I might have found a different bug. I did kill -ABRT a spinning child and gave the core dump to gdb; it seemed that both threads were stuck inside OpenBSD's thread library. The main thread was stuck in pthread_join(), and the other thread was stuck in _rthread_tls_destructors(). I did not find any thread stuck in the loop `while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0))` identified by Charlie Smurthwaite in the original bug report.

Anyone can use Gregory's sched_yield_loop.rb to check for the bug. If the weaker patch from Eric Wong fixes the bug for Linux, I suggest to put the weaker patch in trunk, and to backport it to older Ruby versions.

----------------------------------------
Bug #13794: Infinite loop of sched_yield
https://bugs.ruby-lang.org/issues/13794#change-67001

* Author: catphish (Charlie Smurthwaite)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.3.4p301 (2017-03-30 revision 58214) [x86_64-linux]
* Backport: 2.2: UNKNOWN, 2.3: UNKNOWN, 2.4: UNKNOWN
----------------------------------------
I have been encountering an issue with processes hanging in an infinite loop of calling sched_yield(). The looping code can be found at https://github.com/ruby/ruby/blob/v2_3_4/thread_pthread.c#L1663

while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0)) {
  native_thread_yield();
}

It is my belief that by some mechanism I have not been able to identify, timer_thread_pipe.writing is incremented but it never decremented, causing this loop to run infinitely.

I am not able to create a reproducible test case, however this issue occurs regularly in my production application. I have attached backtraces and thread lists from 2 processes exhibiting this behaviour. gdb confirms that timer_thread_pipe.writing = 1 in these processes.

I believe one possibility of the cause is that rb_thread_wakeup_timer_thread() or rb_thread_wakeup_timer_thread_low() is called, and before it returns, another thread calls fork(), leaving the value of timer_thread_pipe.writing incremented, but leaving behind the thread that would normally decrement it.

If this is correct, one solution would be to reset timer_thread_pipe.writing to 0 in native_reset_timer_thread() immediately after a fork.

Other examples of similar bugs being reported:
https://github.com/resque/resque/issues/578
https://github.com/zk-ruby/zk/issues/50

---Files--------------------------------
backtrace_1.txt (14 KB)
backtrace_2.txt (10.9 KB)
sched_yield_1.patch (738 Bytes)
sched_yield_loop.rb (212 Bytes)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>