From: samuel@... Date: 2019-07-12T13:16:50+00:00 Subject: [ruby-core:93720] [Ruby master Feature#15997] Improve performance of fiber creation by using pool allocation strategy. Issue #15997 has been updated by ioquatix (Samuel Williams). Okay, so I implemented fiber pool changes which make it more suitable for 32-bit platform. It required additional book keeping. Essentially, the allocation list and free list became double linked which allows us to remove allocations and vacant stacks as required. It's more book keeping but the performance overhead is negligible. Now, if fiber pool allocation becomes empty, we can remove it entirely. This means, the address space is freed too. So, on 32-bit platform, if we cap fiber pool size fo maximum 4 - 8 stacks, maybe it's acceptable. We can also now experiment with the following situations: - (1) When fiber pool allocation becomes unused, `munmap` it; reduce physical memory usage and address space. - (2) When fiber pool stack is released, `madvise(free)` it; reduce physical memory usage/swap usage only). - (3) When fiber pool stack is released, do nothing. It remains in the cache. If there is memory pressure, it can get swapped to disk. - (4) When the fiber pool stack is released, do nothing. On major GC, do one of the above. The code for the above decision is in `fiber_pool_stack_release`: ```c if (stack.allocation->used == 0) { fiber_pool_allocation_free(stack.allocation); } fiber_pool_stack_free(&vacancy->stack); ``` Here are the difference I performance, on macOS, comparing with Ruby 2.6.2: (1) `munmap` ``` Calculating ------------------------------------- compare-ruby built-ruby vm2_fiber_allocate 85.437k 130.066k i/s - 100.000k times in 1.170459s 0.768840s vm2_fiber_count 3.812k 88.426k i/s - 100.000k times in 26.233741s 1.130887s vm2_fiber_reuse 61.527 109.893 i/s - 200.000 times in 3.250625s 1.819951s vm2_fiber_switch 9.438M 8.894M i/s - 20.000M times in 2.119203s 2.248799s Comparison: vm2_fiber_allocate built-ruby: 130066.1 i/s compare-ruby: 85436.6 i/s - 1.52x slower vm2_fiber_count built-ruby: 88426.2 i/s compare-ruby: 3811.9 i/s - 23.20x slower vm2_fiber_reuse built-ruby: 109.9 i/s compare-ruby: 61.5 i/s - 1.79x slower vm2_fiber_switch compare-ruby: 9437510.2 i/s built-ruby: 8893636.1 i/s - 1.06x slower ``` (2) `madvise(free)` ``` Comparison: vm2_fiber_allocate built-ruby: 129641.0 i/s compare-ruby: 101306.1 i/s - 1.28x slower vm2_fiber_count built-ruby: 87447.4 i/s compare-ruby: 3945.7 i/s - 22.16x slower vm2_fiber_reuse built-ruby: 110.6 i/s compare-ruby: 61.7 i/s - 1.79x slower vm2_fiber_switch compare-ruby: 9397149.4 i/s built-ruby: 9095279.0 i/s - 1.03x slower ``` (3) nothing ``` Calculating ------------------------------------- compare-ruby built-ruby vm2_fiber_allocate 103.792k 129.309k i/s - 100.000k times in 0.963461s 0.773340s vm2_fiber_count 4.014k 90.957k i/s - 100.000k times in 24.914262s 1.099417s vm2_fiber_reuse 61.038 644.538 i/s - 200.000 times in 3.276653s 0.310300s vm2_fiber_switch 8.662M 9.196M i/s - 20.000M times in 2.309065s 2.174784s Comparison: vm2_fiber_allocate built-ruby: 129309.2 i/s compare-ruby: 103792.5 i/s - 1.25x slower vm2_fiber_count built-ruby: 90957.3 i/s compare-ruby: 4013.8 i/s - 22.66x slower vm2_fiber_reuse built-ruby: 644.5 i/s compare-ruby: 61.0 i/s - 10.56x slower N.B. on Linux server, it's about 7x. vm2_fiber_switch built-ruby: 9196315.6 i/s compare-ruby: 8661514.5 i/s - 1.06x slower ``` As you can see, trying to free address space or reduce memory/swap usage has a significant overhead in the `vm2_fiber_reuse` case, which is one of the most important for long running servers. (1) & (2) look similar In terms of performance, with `munmap` perhaps being slightly worse. (1) `munmap` releases address space back to system, which is ideal for 32-bit address space. (2) `madvise(free)` should be much faster than `munmap`, but it doesn't seem significant. It leaves address space intact, but tells system that stack memory region is no longer needed, and it avoids the need to swap it to disk when there is memory pressure. (3) address space is left in place. If system experiences memory pressure, stack areas are swapped to disk, even if unused. Because of this, if the user allocated 1 million fibers, a large amount of address space and swap space may be consumed. However I would like to believe this isn't such a big problem. While I think the answer for 32-bit system is clearly (1), the best option for 64-bit is not obvious. (2) is pessimistic, while (3) is optimistic and may over-commit memory. There is one solution to this however. We could utilise `GC.compact` or a similar mechanism. That way, we could use (3), but apply (1) and (2) as appropriate if `GC.compact` is invoked. There are other options here too: e.g. major GC, some kind of temporal GC (release fiber pool if it was no used after some time), `madvise(free)` only if more than 50% of stacks are freed, etc. However, I like simple, deterministic option, so maybe I personally lean towards `GC.compact` or `Fiber::Pool.shared.compact`, or some other similar method. ---------------------------------------- Feature #15997: Improve performance of fiber creation by using pool allocation strategy. https://bugs.ruby-lang.org/issues/15997#change-79353 * Author: ioquatix (Samuel Williams) * Status: Open * Priority: Normal * Assignee: ko1 (Koichi Sasada) * Target version: ---------------------------------------- https://github.com/ruby/ruby/pull/2224 This PR improves the performance of fiber allocation and reuse by implementing a better stack cache. The fiber pool manages a singly linked list of fiber pool allocations. The fiber pool allocation contains 1 or more stack (typically more, e.g. 512). It uses N^2 allocation strategy, starting at 8 initial stacks, next is 8, 16, 32, etc. ``` // // base = +-------------------------------+-----------------------+ + // |VM Stack |VM Stack | | | // | | | | | // | | | | | // +-------------------------------+ | | // |Machine Stack |Machine Stack | | | // | | | | | // | | | | | // | | | . . . . | | size // | | | | | // | | | | | // | | | | | // | | | | | // | | | | | // +-------------------------------+ | | // |Guard Page |Guard Page | | | // +-------------------------------+-----------------------+ v // // +-------------------------------------------------------> // // count // ``` The performance improvement depends on usage: ``` Calculating ------------------------------------- compare-ruby built-ruby vm2_fiber_allocate 132.900k 180.852k i/s - 100.000k times in 0.752447s 0.552939s vm2_fiber_count 5.317k 110.724k i/s - 100.000k times in 18.806479s 0.903145s vm2_fiber_reuse 160.128 347.663 i/s - 200.000 times in 1.249003s 0.575269s vm2_fiber_switch 13.429M 13.490M i/s - 20.000M times in 1.489303s 1.482549s Comparison: vm2_fiber_allocate built-ruby: 180851.6 i/s compare-ruby: 132899.7 i/s - 1.36x slower vm2_fiber_count built-ruby: 110724.3 i/s compare-ruby: 5317.3 i/s - 20.82x slower vm2_fiber_reuse built-ruby: 347.7 i/s compare-ruby: 160.1 i/s - 2.17x slower vm2_fiber_switch built-ruby: 13490282.4 i/s compare-ruby: 13429100.0 i/s - 1.00x slower ``` This test is run on Linux server with 64GB memory and 4-core Xeon (Intel Xeon CPU E3-1240 v6 @ 3.70GHz). "compare-ruby" is `master`, and "built-ruby" is `master+fiber-pool`. Additionally, we conservatively use `madvise(free)` to avoid swap space usage for unused fiber stacks. However, if you remove this requirement, we can get 6x - 10x performance improvement in `vm2_fiber_reuse` benchmark. There are some options to deal with this (e.g. moving it to `GC.compact`) but as this is still a net win, I'd like to merge this PR as is. -- https://bugs.ruby-lang.org/ Unsubscribe: