ruby-core

Issue #21571 has been updated by byroot (Jean Boussier).

Status changed from Open to Rejected

> there is only one thread for most of the processes I inspect.

In the child. But the most likely explanation here is that there are multiple thread in the parent that forks the children. And I highly suspect one of these thread occasionally does some HTTPS requests or some other use of OpenSSL.

If you happen to fork at the wrong time, when one of these threads hold a global mutex in OpenSSL, the children might deadlock if it tries to acquire that same mutex, as the mutex is permanently held by a now dead thread.

In other words, this isn't a Ruby bug, but an application one.

A few suggestions though:

A quick and dirty workaround is to exist your child with `exit!`, so that exit handlers aren't run. That should "fix" your issue at hand, but could have other adverse effects.

A cleaner fix is to find that background threads in the parent, and synchronize it to ensure it's at a safepoint when you for your children, or simply to eliminate it.

However, note that:

> we implemented this a long time ago because Ruby never gives up any memory that it takes

Isn't true. Ruby will free pages that are fully empty. It is true that fragmentation can sometime means more pages that you'd like remain held, but it's not that terrible. Also this ephemeral forking means the VM never has the chance of warming up, same for YJIT. So I'd really suggest to reconsider that choice.




----------------------------------------
Bug #21571: Ruby forked process sporadically hanging on exit
https://bugs.ruby-lang.org/issues/21571#change-114555

* Author: dmorner (Daniel Orner)
* Status: Rejected
* ruby -v: ruby 3.4.5 (2025-07-16 revision 20cda200d3) +YJIT +PRISM [x86_64-linux]
* Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN
----------------------------------------
This is my first bug report, so please let me know if there's anything I can do to improve it.

We have a production-grade Rails app that's been running for many years. We recently moved to EKS and upgraded it to the latest Ruby and Rails. We have a number of delayed_job processes that fork on every job that comes in so that the OS can reclaim the memory used in executing it (we implemented this a long time ago because Ruby never gives up any memory that it takes, and some jobs use way more memory than others).

In the last couple of weeks, we've noticed a rare occurrence where the delayed job hangs when exiting. The code looks like this:

<pre>
    Process.fork do
      ActiveRecord::Base.establish_connection
      execute_job
    end
    Process.wait
</pre>

The forked child process doesn't exit when this bug occurs, it's just stuck forever, doing nothing.

Obviously I don't have a way to reproduce this because it happens maybe once every few thousand jobs, and it happens across all job types.

If I run gdb on the child process, I always see something that looks like this (note: I am a total gdb newbie):

<pre>
#0  __futex_abstimed_wait_common
    (futex_word=futex_word@entry=0x7fb6af41400c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=false) at ./nptl/futex-internal.c:103
#1  0x00007fb6d5677f68 in __GI___futex_abstimed_wait64
    (futex_word=futex_word@entry=0x7fb6af41400c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:128
#2  0x00007fb6d568138c in __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x7fb6af414000) at ./nptl/pthread_rwlock_common.c:730
#3  ___pthread_rwlock_wrlock (rwlock=0x7fb6af414000) at ./nptl/pthread_rwlock_wrlock.c:26
#4  0x00007fb6aee22989 in CRYPTO_THREAD_write_lock () at /lib/x86_64-linux-gnu/libcrypto.so.3
#5  0x00007fb6aee15c6a in  () at /lib/x86_64-linux-gnu/libcrypto.so.3
#6  0x00007fb6aee15fa9 in OPENSSL_thread_stop () at /lib/x86_64-linux-gnu/libcrypto.so.3
#7  0x00007fb6aee153b5 in OPENSSL_cleanup () at /lib/x86_64-linux-gnu/libcrypto.so.3
#8  0x00007fb6d563055d in __run_exit_handlers
    (status=0, listp=0x7fb6d57c5820 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at ./stdlib/exit.c:116
#9  0x00007fb6d563069a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:146
#10 0x00007fb6d5ad3a80 in ruby_stop (ex=<optimized out>) at eval.c:290
#11 0x00007fb6d5bc47b4 in rb_f_fork (obj=<optimized out>) at process.c:4388
#12 rb_f_fork (obj=<optimized out>) at process.c:4378
#13 0x00007fb6d5cad5cc in vm_call_cfunc_with_frame_
    (stack_bottom=<optimized out>, argv=<optimized out>, argc=0, calling=<optimized out>, reg_cfp=0x7fb6d4f68280, ec=0x7fb6d4e4d550)
    at /usr/src/ruby/vm_insnhelper.c:3794
#14 vm_call_cfunc_with_frame (ec=0x7fb6d4e4d550, reg_cfp=0x7fb6d4f68280, calling=<optimized out>) at /usr/src/ruby/vm_insnhelper.c:3840
#15 0x00007fb6d5cb3fef in vm_sendish
    (ec=0x7fb6d4e4d550, reg_cfp=0x7fb6d4f68280, cd=0x7fb69fb17650, block_handler=<optimized out>, method_explorer=mexp_search_method)
    at /usr/src/ruby/vm_callinfo.h:415
#16 0x00007fb6d5cc1e59 in vm_exec_core (ec=0x7fb6af41400c, ec@entry=0x7fb6d4e4d550) at /usr/src/ruby/insns.def:851
#17 0x00007fb6d5cc7ba9 in rb_vm_exec (ec=0x7fb6d4e4d550) at vm.c:2595
#18 0x00007fb6b13e73b9 in  ()
#19 0x00007fb6d4f68328 in  ()
...etc, I can paste more if needed
</pre>

I can't seem to get `call rb_backtrace()` working in gdb, it never prints anything.

This seems to indicate that there's some kind of thread lock when OpenSSL is shutting down. The crazy thing is that **there is only one thread** for most of the processes I inspect.

Any help would be greatly appreciated!



-- 
https://bugs.ruby-lang.org/
______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/lists/ruby-core.ml.ruby-lang.org/

Thread

Prev Next

In This Thread

Prev Next