From: samuel@... Date: 2019-07-12T06:17:21+00:00 Subject: [ruby-core:93713] [Ruby master Feature#15997] Improve performance of fiber creation by using pool allocation strategy. Issue #15997 has been updated by ioquatix (Samuel Williams). > I know you got measurements. please share us. I added `show_limit` to bootstrap test so we can see for all platforms. However, all platforms I tested could allocate 10,000 fibers easily. e.g. all builds on Travis, AppVeyor, etc. When we explored increasing fiber stack size (to the same as thread stack size), we did create some problem for 32-bit platforms. On Linux, we can artificially limit the memory (e.g. 4GB) to see how behaviour changes. ``` 2.7.0-fiber-pool $ bash -c "ulimit -v 4000000; ./ruby --disable-gems ./count.rb" ... snip ... 0.059s to create 5113 fibers [GC.count=0] ./count.rb:16:in `resume': can't alloc machine stack to fiber (1024 x 659456 bytes): Cannot allocate memory (FiberError) ``` ``` 2.6.3 $ bash -c "ulimit -v 4000000; ./ruby --disable-gems ./count.rb" ... snip ... 0.119s to create 6118 fibers [GC.count=0] ./count.rb:16:in `resume': can't alloc machine stack to fiber: Cannot allocate memory (FiberError) ``` The main concern I had for 32-bit implementation is fiber pool consuming all address space. Well, 32-bit address space is very limited. There is a simple fix for this if it's a major blocking point: we can revert back to individual fiber allocation and deallocation. It's straight forward to implement actually since all fibers now just use two functions: `fiber_pool_stack_acquire` and `fiber_pool_stack_release`. We can just replace these with direct `mmap` and `munmap`. I didn't bother because I don't know if it's problem in reality or just theoretical. Regarding upper limits, I tested more extreme case. I could allocate 4 million fibers in about 2 minutes on my server (same specs as listed in summary), and it used 2.4TB of address space, and 50GB of actual memory. This is with GC disabled, so it's not exactly realistic test, but does show some kind of upper limit. ---------------------------------------- Feature #15997: Improve performance of fiber creation by using pool allocation strategy. https://bugs.ruby-lang.org/issues/15997#change-79346 * Author: ioquatix (Samuel Williams) * Status: Open * Priority: Normal * Assignee: ko1 (Koichi Sasada) * Target version: ---------------------------------------- https://github.com/ruby/ruby/pull/2224 This PR improves the performance of fiber allocation and reuse by implementing a better stack cache. The fiber pool manages a singly linked list of fiber pool allocations. The fiber pool allocation contains 1 or more stack (typically more, e.g. 512). It uses N^2 allocation strategy, starting at 8 initial stacks, next is 8, 16, 32, etc. ``` // // base = +-------------------------------+-----------------------+ + // |VM Stack |VM Stack | | | // | | | | | // | | | | | // +-------------------------------+ | | // |Machine Stack |Machine Stack | | | // | | | | | // | | | | | // | | | . . . . | | size // | | | | | // | | | | | // | | | | | // | | | | | // | | | | | // +-------------------------------+ | | // |Guard Page |Guard Page | | | // +-------------------------------+-----------------------+ v // // +-------------------------------------------------------> // // count // ``` The performance improvement depends on usage: ``` Calculating ------------------------------------- compare-ruby built-ruby vm2_fiber_allocate 132.900k 180.852k i/s - 100.000k times in 0.752447s 0.552939s vm2_fiber_count 5.317k 110.724k i/s - 100.000k times in 18.806479s 0.903145s vm2_fiber_reuse 160.128 347.663 i/s - 200.000 times in 1.249003s 0.575269s vm2_fiber_switch 13.429M 13.490M i/s - 20.000M times in 1.489303s 1.482549s Comparison: vm2_fiber_allocate built-ruby: 180851.6 i/s compare-ruby: 132899.7 i/s - 1.36x slower vm2_fiber_count built-ruby: 110724.3 i/s compare-ruby: 5317.3 i/s - 20.82x slower vm2_fiber_reuse built-ruby: 347.7 i/s compare-ruby: 160.1 i/s - 2.17x slower vm2_fiber_switch built-ruby: 13490282.4 i/s compare-ruby: 13429100.0 i/s - 1.00x slower ``` This test is run on Linux server with 64GB memory and 4-core Xeon (Intel Xeon CPU E3-1240 v6 @ 3.70GHz). "compare-ruby" is `master`, and "built-ruby" is `master+fiber-pool`. Additionally, we conservatively use `madvise(free)` to avoid swap space usage for unused fiber stacks. However, if you remove this requirement, we can get 6x - 10x performance improvement in `vm2_fiber_reuse` benchmark. There are some options to deal with this (e.g. moving it to `GC.compact`) but as this is still a net win, I'd like to merge this PR as is. -- https://bugs.ruby-lang.org/ Unsubscribe: