From: samuel@...
Date: 2019-07-12T13:16:50+00:00
Subject: [ruby-core:93720] [Ruby master Feature#15997] Improve performance of fiber creation by using pool allocation strategy.

Issue #15997 has been updated by ioquatix (Samuel Williams).


Okay, so I implemented fiber pool changes which make it more suitable for 32-bit platform. It required additional book keeping. Essentially, the allocation list and free list became double linked which allows us to remove allocations and vacant stacks as required. It's more book keeping but the performance overhead is negligible.

Now, if fiber pool allocation becomes empty, we can remove it entirely. This means, the address space is freed too. So, on 32-bit platform, if we cap fiber pool size fo maximum 4 - 8 stacks, maybe it's acceptable.

We can also now experiment with the following situations:

- (1) When fiber pool allocation becomes unused, `munmap` it; reduce physical memory usage and address space.
- (2) When fiber pool stack is released, `madvise(free)` it; reduce physical memory usage/swap usage only).
- (3) When fiber pool stack is released, do nothing. It remains in the cache. If there is memory pressure, it can get swapped to disk.
- (4) When the fiber pool stack is released, do nothing. On major GC, do one of the above.

The code for the above decision is in `fiber_pool_stack_release`:

```c
    if (stack.allocation->used == 0) {
        fiber_pool_allocation_free(stack.allocation);
    }

    fiber_pool_stack_free(&vacancy->stack);
```

Here are the difference I performance, on macOS, comparing with Ruby 2.6.2:

(1) `munmap`

```
Calculating -------------------------------------
                     compare-ruby  built-ruby 
  vm2_fiber_allocate      85.437k    130.066k i/s -    100.000k times in 1.170459s 0.768840s
     vm2_fiber_count       3.812k     88.426k i/s -    100.000k times in 26.233741s 1.130887s
     vm2_fiber_reuse       61.527     109.893 i/s -     200.000 times in 3.250625s 1.819951s
    vm2_fiber_switch       9.438M      8.894M i/s -     20.000M times in 2.119203s 2.248799s

Comparison:
               vm2_fiber_allocate
          built-ruby:    130066.1 i/s 
        compare-ruby:     85436.6 i/s - 1.52x  slower

                  vm2_fiber_count
          built-ruby:     88426.2 i/s 
        compare-ruby:      3811.9 i/s - 23.20x  slower

                  vm2_fiber_reuse
          built-ruby:       109.9 i/s 
        compare-ruby:        61.5 i/s - 1.79x  slower

                 vm2_fiber_switch
        compare-ruby:   9437510.2 i/s 
          built-ruby:   8893636.1 i/s - 1.06x  slower
```

(2) `madvise(free)`

```
Comparison:
               vm2_fiber_allocate
          built-ruby:    129641.0 i/s 
        compare-ruby:    101306.1 i/s - 1.28x  slower

                  vm2_fiber_count
          built-ruby:     87447.4 i/s 
        compare-ruby:      3945.7 i/s - 22.16x  slower

                  vm2_fiber_reuse
          built-ruby:       110.6 i/s 
        compare-ruby:        61.7 i/s - 1.79x  slower

                 vm2_fiber_switch
        compare-ruby:   9397149.4 i/s 
          built-ruby:   9095279.0 i/s - 1.03x  slower
```

(3) nothing

```
Calculating -------------------------------------
                     compare-ruby  built-ruby 
  vm2_fiber_allocate     103.792k    129.309k i/s -    100.000k times in 0.963461s 0.773340s
     vm2_fiber_count       4.014k     90.957k i/s -    100.000k times in 24.914262s 1.099417s
     vm2_fiber_reuse       61.038     644.538 i/s -     200.000 times in 3.276653s 0.310300s
    vm2_fiber_switch       8.662M      9.196M i/s -     20.000M times in 2.309065s 2.174784s

Comparison:
               vm2_fiber_allocate
          built-ruby:    129309.2 i/s 
        compare-ruby:    103792.5 i/s - 1.25x  slower

                  vm2_fiber_count
          built-ruby:     90957.3 i/s 
        compare-ruby:      4013.8 i/s - 22.66x  slower

                  vm2_fiber_reuse
          built-ruby:       644.5 i/s 
        compare-ruby:        61.0 i/s - 10.56x  slower          N.B. on Linux server, it's about 7x.

                 vm2_fiber_switch
          built-ruby:   9196315.6 i/s 
        compare-ruby:   8661514.5 i/s - 1.06x  slower
```

As you can see, trying to free address space or reduce memory/swap usage has a significant overhead in the `vm2_fiber_reuse` case, which is one of the most important for long running servers.

(1) & (2) look similar In terms of performance, with `munmap` perhaps being slightly worse.

(1) `munmap` releases address space back to system, which is ideal for 32-bit address space.

(2) `madvise(free)` should be much faster than `munmap`, but it doesn't seem significant. It leaves address space intact, but tells system that stack memory region is no longer needed, and it avoids the need to swap it to disk when there is memory pressure.

(3) address space is left in place. If system experiences memory pressure, stack areas are swapped to disk, even if unused. Because of this, if the user allocated 1 million fibers, a large amount of address space and swap space may be consumed. However I would like to believe this isn't such a big problem.

While I think the answer for 32-bit system is clearly (1), the best option for 64-bit is not obvious. (2) is pessimistic, while (3) is optimistic and may over-commit memory.

There is one solution to this however. We could utilise `GC.compact` or a similar mechanism. That way, we could use (3), but apply (1) and (2) as appropriate if `GC.compact` is invoked. There are other options here too: e.g. major GC, some kind of temporal GC (release fiber pool if it was no used after some time), `madvise(free)` only if more than 50% of stacks are freed, etc. However, I like simple, deterministic option, so maybe I personally lean towards `GC.compact` or `Fiber::Pool.shared.compact`, or some other similar method.


----------------------------------------
Feature #15997: Improve performance of fiber creation by using pool allocation strategy.
https://bugs.ruby-lang.org/issues/15997#change-79353

* Author: ioquatix (Samuel Williams)
* Status: Open
* Priority: Normal
* Assignee: ko1 (Koichi Sasada)
* Target version: 
----------------------------------------
https://github.com/ruby/ruby/pull/2224

This PR improves the performance of fiber allocation and reuse by implementing a better stack cache.

The fiber pool manages a singly linked list of fiber pool allocations. The fiber pool allocation contains 1 or more stack (typically more, e.g. 512). It uses N^2 allocation strategy, starting at 8 initial stacks, next is 8, 16, 32, etc.

```
//
// base = +-------------------------------+-----------------------+  +
//        |VM Stack       |VM Stack       |                       |  |
//        |               |               |                       |  |
//        |               |               |                       |  |
//        +-------------------------------+                       |  |
//        |Machine Stack  |Machine Stack  |                       |  |
//        |               |               |                       |  |
//        |               |               |                       |  |
//        |               |               | .  .  .  .            |  |  size
//        |               |               |                       |  |
//        |               |               |                       |  |
//        |               |               |                       |  |
//        |               |               |                       |  |
//        |               |               |                       |  |
//        +-------------------------------+                       |  |
//        |Guard Page     |Guard Page     |                       |  |
//        +-------------------------------+-----------------------+  v
//
//        +------------------------------------------------------->
//
//                                  count
//
```

The performance improvement depends on usage:

```
Calculating -------------------------------------
                     compare-ruby  built-ruby 
  vm2_fiber_allocate     132.900k    180.852k i/s -    100.000k times in 0.752447s 0.552939s
     vm2_fiber_count       5.317k    110.724k i/s -    100.000k times in 18.806479s 0.903145s
     vm2_fiber_reuse      160.128     347.663 i/s -     200.000 times in 1.249003s 0.575269s
    vm2_fiber_switch      13.429M     13.490M i/s -     20.000M times in 1.489303s 1.482549s

Comparison:
               vm2_fiber_allocate
          built-ruby:    180851.6 i/s 
        compare-ruby:    132899.7 i/s - 1.36x  slower

                  vm2_fiber_count
          built-ruby:    110724.3 i/s 
        compare-ruby:      5317.3 i/s - 20.82x  slower

                  vm2_fiber_reuse
          built-ruby:       347.7 i/s 
        compare-ruby:       160.1 i/s - 2.17x  slower

                 vm2_fiber_switch
          built-ruby:  13490282.4 i/s 
        compare-ruby:  13429100.0 i/s - 1.00x  slower
```

This test is run on Linux server with 64GB memory and 4-core Xeon (Intel Xeon CPU E3-1240 v6 @ 3.70GHz). "compare-ruby" is `master`, and "built-ruby" is `master+fiber-pool`.

Additionally, we conservatively use `madvise(free)` to avoid swap space usage for unused fiber stacks. However, if you remove this requirement, we can get 6x - 10x performance improvement in `vm2_fiber_reuse` benchmark. There are some options to deal with this (e.g. moving it to `GC.compact`) but as this is still a net win, I'd like to merge this PR as is.


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>