From: peter@... Date: 2021-05-07T13:18:12+00:00 Subject: [ruby-core:103772] [Ruby master Feature#17816] Move C heap allocations for RVALUE object data into GC heap Issue #17816 has been updated by peterzhu2118 (Peter Zhu). Status changed from Open to Closed Closed as PR has been merged. ---------------------------------------- Feature #17816: Move C heap allocations for RVALUE object data into GC heap https://bugs.ruby-lang.org/issues/17816#change-91884 * Author: eightbitraptor (Matthew Valentine-House) * Status: Closed * Priority: Normal ---------------------------------------- ## Pull Request: [Github PR: 4391](https://github.com/ruby/ruby/pull/4391) ## Introduction _**This work supersedes the work in [PR: 4107](https://github.com/ruby/ruby/pull/4107) and [Redmine: 17570](https://bugs.ruby-lang.org/issues/17570). We've reimplemented the feature to make the diff smaller, easier to maintain and less intrusive to existing data structures.**_ We're working at Shopify to restructure Ruby memory management in order to allow objects to occupy more than one heap slot. This will allow previously heap allocated data to be stored next to its associated `RVALUE` slot in a contiguous memory region. We believe that this will simplify the internals of the GC by: * Removing the distinction between embedded and heap allocated objects as everything will now effectively be embedded across multiple slots. * Allowing us to remove the transient heap. The transient heap reduces the number of `malloc` calls for heap allocated objects by deferring them until the object is promoted to an old object. When objects no longer need to call `malloc`, the transient heap can be removed. We believe that there will be performance improvements across most Ruby codebases as a result of these simplifications. Objects will also have improved data locality, resulting in improved hardware cache performance. ## Summary of changes This is a rewrite of a feature initially proposed in [PR #4107](https://github.com/ruby/ruby/pull/4107). ![](https://i.imgur.com/8x22ylD.png) The referenced PR adds the core implementation and API in order to store arbitrary length data inside contiguous free slots on the heap. It also includes a reference implementation for `T_CLASS` objects, that would usually allocate the `rb_classext_t` struct on the system heap. The current API is: * `RVARGC_NEWOBJ_OF` - A reimplementation of the `NEWOBJ_OF` macro that takes an additional parameter `payload_length`, the length of the payload data to store in bytes. * `rb_rvargc_payload_data_ptr` - a `void *` to the start of the region where the extra data can be allocated. We've introduced a new type `T_PAYLOAD` and a `struct RPayload` that contains a single `VALUE flags`. We use the `FL_USER` bits to store the number of payload slots so that we can stride over the payload body in most places where heap walking is required (as these slots can now contain user defined data they will not have accurate `flags` and so most type checks will be incorrect). When `RVARGC_NEWOBJ_OF` is called with a payload size, we calculate the number of slots required to store the `RVALUE`, an `RPayload` and the payload data itself. We then first search the ractors `newobj_cache` for a region of the required size, remove the slots from the freelist and initialize them. Then a pointer to the first allocatable byte in the payload body section can be found using `rb_rvargc_payload_data_ptr`. These changes can be enabled using the compile time flag `USE_RVARGC=1`. * **We do not expect anyone to run production Ruby applications with this flag enabled. This is an experimental feature which we will improve incrementally.** * **Should these experiments prove unsuccessful in the long term, We will completely remove this feature and all related code** * **This PR has no performance implications when `USE_RVARGC` is disabled. Allocation of `RVALUE`s in a single slot behaves almost identically to before this change (see [Benchmarking data](#Benchmarking).** ## Features (and challenges) * `T_PAYLOAD` is fully integrated with the existing GC. The entire payload region will be treated as one single slot for marking, sweeping and generational purposes. In contrast with our previous attempt this means we no longer need to disable incremental marking, nor do we need to use an extra bitmap attached to a heap_page. * All slots that are part of a `T_CLASS` and its payload region are pinned, so compaction will not move them. This has impacted the effectiveness of compaction, but unlike our previous PR, doesn't require us to disable compaction completely. * RSS is significantly larger when `USE_RVARGC` is enabled. This is due to our (currently) naive approach to free region allocation. ## Next steps With this merged. We have several different directions we intend to investigate * Performance benchmarking: Analysing L1, 2 and 3 cache performance to decide where best to introduce RVarGC first, and what (if any) performance gains we'll see by improving data locality. Our current speculative contenders are Arrays, ivars, strings. * Improvements to the way the Payload data is managed: move the payload length into the RVALUE itself, and inline the payload body, removing the need for the `T_PAYLOAD` object entirely. * Compaction improvements: Investigating which compaction algorithms perform better with objects of variable size. * Resize payload regions. Currently we have no support for resizing payload regions. This must be fixed before we can support many of the different Ruby types. * Free region allocation: Find a way of managing the freelist that performs better with allocations of contiguous regions than the current singly linked freelist appraoch. The end game for this work is to be remove the requirement for an `RVALUE` to be exactly 40 bytes wide. This is obviously a long game, of which this PR takes the first steps. ### Benchmarking We used [Railsbench](https://github.com/k0kubun/railsbench) to compare the performance of master with our branch, with `USE_RVARGC=0` ``` ubuntu@ip-172-31-42-217:~/railsbench$ chruby master ubuntu@ip-172-31-42-217:~/railsbench$ setarch x86_64 -R nice -20 taskset -c 75 ./bin/bench ruby 3.1.0dev (2021-04-19T12:40:29Z master 50f17241a3) [x86_64-linux] {"cppflags"=>"-DUSE_RVARGC=0", "optflags"=>"-O3 -fno-fast-math"} Warming up... Benchmark: 10000 requests Request per second: 747.3 [#/s] (mean) Percentage of the requests served within a certain time (ms) 50% 1.32 66% 1.36 75% 1.38 80% 1.39 90% 1.42 95% 1.46 98% 1.53 99% 1.84 100% 11.40 ubuntu@ip-172-31-42-217:~/railsbench$ setarch x86_64 -R nice -20 taskset -c 75 ./bin/bench ruby 3.1.0dev (2021-04-20T10:02:39Z mvh-rvargc 2045bfb7f7) [x86_64-linux] {"cppflags"=>"-DUSE_RVARGC=0", "optflags"=>"-O3 -fno-fast-math"} Warming up... Benchmark: 10000 requests Request per second: 746.3 [#/s] (mean) Percentage of the requests served within a certain time (ms) 50% 1.31 66% 1.37 75% 1.39 80% 1.39 90% 1.41 95% 1.44 98% 1.51 99% 1.83 100% 8.97 ``` And the same comparison using Optcarrot: ``` ubuntu@ip-172-31-42-217:~/optcarrot$ chruby master ubuntu@ip-172-31-42-217:~/optcarrot$ ./bin/optcarrot --benchmark examples/Lan_Master.nes fps: 43.62907118228718 checksum: 59662 ubuntu@ip-172-31-42-217:~/optcarrot$ chruby rvargc ubuntu@ip-172-31-42-217:~/optcarrot$ ./bin/optcarrot --benchmark examples/Lan_Master.nes fps: 43.90831352849611 checksum: 59662 ``` -- https://bugs.ruby-lang.org/ Unsubscribe: