From: ko1@... Date: 2021-01-29T09:20:50+00:00 Subject: [ruby-core:102295] [Ruby master Bug#17573] Crashes in profiling tools when signals arrive in non-Ruby threads Issue #17573 has been updated by ko1 (Koichi Sasada). Assignee set to ko1 (Koichi Sasada) ---------------------------------------- Bug #17573: Crashes in profiling tools when signals arrive in non-Ruby threads https://bugs.ruby-lang.org/issues/17573#change-90150 * Author: jhawthorn (John Hawthorn) * Status: Open * Priority: Normal * Assignee: ko1 (Koichi Sasada) * ruby -v: ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19] * Backport: 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN ---------------------------------------- Stackprof (and likely similar tools) works by setting up a timer to sends it a unix signal on an interval. From that signal handler it does a small amount of internal bookkeeping and calls `rb_postponed_job_register_one`. This is a problem because unix signals arrive on an arbitrary thread, and as of Ruby 3.0 the execution context (which `rb_postponed_job_register_one` relies on) is stored as a thread-local. This reproduction crashes reliably for me on macos. It doesn't seem to on linux, maybe because the timer thread is different or the kernel has a different "arbitrary" choice. It feels like this is just one of the circumstances this crash could happen. ```ruby require "stackprof" StackProf.run(interval: 100) do 1000.times do GC.start end end ``` ``` $ ruby crash_stackprof.rb [BUG] Segmentation fault at 0x0000000000000038 ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19] -- Crash Report log information -------------------------------------------- See Crash Report log file under the one of following: * ~/Library/Logs/DiagnosticReports * /Library/Logs/DiagnosticReports for more details. Don't forget to include the above Crash Report log file in bug reports. -- Machine register context ------------------------------------------------ rax: 0x0000000000000000 rbx: 0x0000000107fbb780 rcx: 0x0000000000000000 rdx: 0x0000000000000000 rdi: 0x0000000106982c28 rsi: 0x0000000107fbb780 rbp: 0x000070000eb47a10 rsp: 0x000070000eb479f0 r8: 0x000070000eb47eb0 r9: 0xd44931e7344c235f r10: 0x00007fff6ef49501 r11: 0x0000000000000202 r12: 0xd44931e7344c235f r13: 0x00000000ffffffff r14: 0x0000000000000000 r15: 0x0000000000000000 rip: 0x00000001068c85fd rfl: 0x0000000000010202 -- C level backtrace information ------------------------------------------- /Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_vm_bugreport+0x6cf) [0x1068c2d5f] /Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_bug_for_fatal_signal+0x1d6) [0x1066dc556] /Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(sigsegv+0x5b) [0x10681aa0b] /usr/lib/system/libsystem_platform.dylib(_sigtramp+0x1d) [0x7fff6efff5fd] /Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_postponed_job_register_one+0x1d) [0x1068c85fd] /usr/lib/system/libsystem_platform.dylib(0x7fff6efff5fd) [0x7fff6efff5fd] ``` `0x38` is the address of `((rb_execution_context_t *)0)->vm`. lldb shows that it comes from a second thread which was running `timer_pthread_fn` ``` $ lldb =ruby -- ./crash_stackprof.rb (lldb) target create "/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby" ruCurrent executable set to '/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby' (x86_64). (lldb) settings set -- target.run-args "./crash_stackprof.rb" (lldb) run Process 92893 launched: '/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby' (x86_64) Process 92893 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGALRM frame #0: 0x00000001000dfbcd ruby`rgengc_check_relation [inlined] RVALUE_OLD_P_RAW(obj=4303689480) at gc.c:1419:32 1416 RVALUE_OLD_P_RAW(VALUE obj) 1417 { 1418 const VALUE promoted = FL_PROMOTED0 | FL_PROMOTED1; -> 1419 return (RBASIC(obj)->flags & promoted) == promoted; 1420 } 1421 1422 static inline int thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x38) frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19 1619 rb_postponed_job_register_one(unsigned int flags, rb_postponed_job_func_t func, void *data) 1620 { 1621 rb_execution_context_t *ec = GET_EC(); -> 1622 rb_vm_t *vm = rb_ec_vm_ptr(ec); 1623 rb_postponed_job_t *pjob; 1624 rb_atomic_t i, index; 1625 Target 0: (ruby) stopped. (lldb) t 2 * thread #2 frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19 1619 rb_postponed_job_register_one(unsigned int flags, rb_postponed_job_func_t func, void *data) 1620 { 1621 rb_execution_context_t *ec = GET_EC(); -> 1622 rb_vm_t *vm = rb_ec_vm_ptr(ec); 1623 rb_postponed_job_t *pjob; 1624 rb_atomic_t i, index; 1625 (lldb) bt * thread #2 * frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19 frame #1: 0x00007fff6efff5fd libsystem_platform.dylib`_sigtramp + 29 frame #2: 0x00007fff6ef4e3d7 libsystem_kernel.dylib`poll + 11 frame #3: 0x0000000100238e1e ruby`timer_pthread_fn(p=) at thread_pthread.c:2189:15 frame #4: 0x00007fff6f00b109 libsystem_pthread.dylib`_pthread_start + 148 frame #5: 0x00007fff6f006b8b libsystem_pthread.dylib`thread_start + 15 ``` Attached is my attempted fix (also available at https://github.com/ruby/ruby/pull/4108) which uses the main-ractor's EC if there is none on the current thread. I *hope* this works (it seems to and fixes the crash) because before Ruby 3.0 there was a global EC, but I'm not entirely sure if this will cause other problems. If accepted this should be backported to the 3.0 branch. ---Files-------------------------------- use_main_ractor_ec_on_threads_without_ec.patch (3.28 KB) use_main_ractor_ec_on_threads_without_ec.patch (3.84 KB) -- https://bugs.ruby-lang.org/ Unsubscribe: