From: "ko1 (Koichi Sasada)" Date: 2021-10-04T18:30:00+00:00 Subject: [ruby-core:105545] [Ruby master Feature#18239] Variable Width Allocation: Strings Issue #18239 has been updated by ko1 (Koichi Sasada). Thank you for your work. I have several questions. * Could you explain the RString layout? * Could you explain the variable size page strategy? https://bugs.ruby-lang.org/issues/18045 shows 4 sizes (40, 80, 160, 320). Same with it? For example, jemalloc employs more fine-grained pools. DId you try other sizes? * Did you measure the allocation overhead on non-main ractors? I didn't ask details because I didn't think it was enabled on 3.1. ---------------------------------------- Feature #18239: Variable Width Allocation: Strings https://bugs.ruby-lang.org/issues/18239#change-94001 * Author: peterzhu2118 (Peter Zhu) * Status: Open * Priority: Normal ---------------------------------------- # GitHub PR: https://github.com/ruby/ruby/issues/4933 # Feature description Since merging #18045 which introduced size pools inside the GC to allocate various sized slots, we've been working on expanding the usage of VWA into more types (before this patch, only classes are allocated through VWA). This patch changes strings to use VWA. ## Summary - This patch allocates strings using VWA. - String embedded through VWA are embedded. The patch changes strings to support dynamic capacity embedded strings. - We do not handle resizing in VWA in this patch (embedded strings resized up are moved back to malloc), which may result in wasted space. However, in benchmarks, this does not appear to be an issue. - We propose enabling VWA by default. We are confident about the stability and performance of this feature. ## String allocation Strings with known sizes at allocation time that are small enough is allocated as an embedded string. Embedded string headers are now 18 bytes, so the maximum embedded string length is now 302 bytes (currently VWA is configured to allocate up to 320 byte slots, but can be easily configured for even larger slots). Embedded strings have their contents directly follow the object headers. This aims to improve cache performance since the contents are on the same cache line as the object headers. For strings with unknown sizes, or with contents that are too large, it falls back to allocate 40 byte slots and store the contents in the malloc heap. ## String reallocation If an embedded string is expanded and can no longer fill the slot, it is moved into the malloc heap. This may mean that some space in the slot is wasted. For example, if the string was originally allocated in a 160 byte slot and moved to the malloc heap, 120 bytes of the slot is wasted (since we only need 40 bytes for a string with contents on the malloc heap). This patch does not aim to tackle this issue. Memory usage also does not appear to be an issue in any of the benchmarks. ## Incompatibility of VWA and non-VWA built gems Gems with native extensions built with VWA are not compatible with non-VWA built gems (and vice versa). When switching between VWA and non-VWA rubies, gems must be rebuilt. This is because the header file `rstring.h` changes depending on whether `USE_RVARGC` is set or not (the internal string implementation changes with VWA). ## Enabling VWA by default In this patch, we propose enabling VWA by default. We believe that the performance data supports this (see the [benchmark results](#Benchmark-results) section). We're also confident about its stability. It passess all tests on CI for the Shopify monolith and we've ran this version of Ruby on a small portion of production traffic in a Shopify service for about a week (where it served over 500 million requests). Although VWA is enabled by default, we plan on supporting the `USE_RVARGC` flag as an escape hatch to disable VWA until at least Ruby 3.1 is released (i.e. we may remove it starting 2022). # Benchmark setup Benchmarking was done on a bare-metal Ubuntu machine on AWS. All benchmark results are using glibc by default, except when jemalloc is explicitly specified. ``` $ uname -a Linux 5.8.0-1038-aws #40~20.04.1-Ubuntu SMP Thu Jun 17 13:25:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux ``` glibc version: ``` $ ldd --version ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31 Copyright (C) 2020 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Written by Roland McGrath and Ulrich Drepper. ``` jemalloc version: ``` $ apt list --installed | grep jemalloc WARNING: apt does not have a stable CLI interface. Use with caution in scripts. libjemalloc-dev/focal,now 5.2.1-1ubuntu1 amd64 [installed] libjemalloc2/focal,now 5.2.1-1ubuntu1 amd64 [installed,automatic] ``` To measure memory usage over time, the [mstat tool](https://github.com/bpowers/mstat) was used. Ruby master was benchmarked on commit [7adfb14f60](https://github.com/ruby/ruby/commit/7adfb14f60). The branch was rebased on top of the same commit. Performance benchmarks of this branch without VWA turned on are included to make sure this patch does not introduce a performance regression compared to master. # Benchmark results ## Summary - On a live, production Shopify service, we see no significant differences in response times. However, VWA has lower memory usage. - On railsbench, we see a minor performance improvement. However, we also see worse p100 response times. The reason is analyzed in the [railsbench](#railsbench) section below. VWA uses more memory when using glibc, but uses an equal amount of memory for jemalloc. - On rdoc generation, VWA uses significantly less memory for a small reduction in performance. - Microbenchmarks show significant performance improvement for strings that are embedded in VWA (e.g. `String#==` is significantly faster). ## Shopify production We deployed this branch with VWA enabled vs. Ruby master commit 7adfb14f60 on a small portion of live, production traffic for a Shopify web application to test real-world performance. This data was collected for a period of 1 day where they each served approximately 84 million requests. Average, median, p90, p99 response times were within the margin of error of each other (<1% difference). However, VWA consistently had lower memory usage (Resident Set Size) by 0.96x. ## railsbench For railsbench, we ran the [railsbench benchmark](https://github.com/k0kubun/railsbench/blob/master/bin/bench). For both the performance and memory benchmarks, 25 runs were conducted for each combination (branch + glibc, master + glibc, branch + jemalloc, master + jemalloc). In this benchmark, VWA suffers from poor p100. This is because railsbench is relatively small (only one controller and model), so there are not many pages in each size pools. Incremental marking requires there to be many pooled pages, but because there are not a lot of pooled pages (due to the small heap size), it has to mark a very large number of objects at every step. We do not expect this to be a problem for real apps with a larger number of objects allocated at boot, and this is confirmed by metrics collected in the [Shopify production application](#Shopify-production). ### glibc Using glibc, VWA is about 1.018x faster than master in railsbench in throughput (RPS). ``` +-----------+-----------------+------------------+--------+ | | Branch (VWA on) | Branch (VWA off) | Master | +-----------+-----------------+------------------+--------+ | RPS | 740.92 | 722.97 | 745.31 | | p50 (ms) | 1.31 | 1.37 | 1.33 | | p90 (ms) | 1.40 | 1.45 | 1.43 | | p99 (ms) | 2.36 | 1.80 | 1.78 | | p100 (ms) | 21.38 | 17.28 | 15.15 | +-----------+-----------------+------------------+--------+ ``` ![](https://i.imgur.com/9yJ3fO5.png) Average max memory usage for VWA: 104.82 MB Average max memory usage for master: 101.17 MB VWA uses 1.04x more memory. ### jemalloc Using glibc, VWA is about 1.06x faster than master in railsbench in RPS. ``` +-----------+-----------------+------------------+--------+ | | Branch (VWA on) | Branch (VWA off) | Master | +-----------+-----------------+------------------+--------+ | RPS | 781.21 | 742.12 | 739.04 | | p50 (ms) | 1.25 | 1.31 | 1.30 | | p90 (ms) | 1.32 | 1.39 | 1.38 | | p99 (ms) | 2.15 | 2.53 | 4.41 | | p100 (ms) | 21.11 | 16.60 | 15.75 | +-----------+-----------------+------------------+--------+ ``` ![](https://i.imgur.com/Nltbqq0.png) Average max memory usage for VWA: 102.68 MB Average max memory usage for master: 103.14 MB VWA uses 1.00x less memory. ## rdoc generation In rdoc generation, we see significant memory usage reduction at the cost of small performance reduction. ### glibc ``` +-----------+-----------------+------------------+--------+-------------+ | | Branch (VWA on) | Branch (VWA off) | Master | VWA speedup | +-----------+-----------------+------------------+--------+-------------+ | Time (s) | 16.68 | 16.08 | 15.87 | 0.95x | +-----------+-----------------+------------------+--------+-------------+ ``` ![](https://i.imgur.com/HJCtowS.png) Average max memory usage for VWA: 295.19 MB Average max memory usage for master: 365.06 MB VWA uses 0.81x less memory. ### jemalloc ``` +-----------+-----------------+------------------+--------+-------------+ | | Branch (VWA on) | Branch (VWA off) | Master | VWA speedup | +-----------+-----------------+------------------+--------+-------------+ | Time (s) | 16.34 | 15.64 | 15.51 | 0.95x | +-----------+-----------------+------------------+--------+-------------+ ``` ![](https://i.imgur.com/lrdUxXw.png) Average max memory usage for VWA: 281.16 MB Average max memory usage for master: 316.49 MB VWA uses 0.89x less memory. ## Liquid benchmarks For the liquid benchmarks, we ran the [liquid benchmark](https://github.com/Shopify/liquid/blob/master/performance/benchmark.rb) averaged over 5 runs each. We see that VWA is faster across the board. ``` +----------------------+-----------------+------------------+--------+-------------+ | | Branch (VWA on) | Branch (VWA off) | Master | VWA speedup | +----------------------+-----------------+------------------+--------+-------------+ | Parse (i/s) | 40.40 | 38.44 | 39.47 | 1.02x | | Render (i/s) | 126.47 | 121.97 | 121.20 | 1.04x | | Parse & Render (i/s) | 28.81 | 27.52 | 28.02 | 1.03x | +----------------------+-----------------+------------------+--------+-------------+ ``` ## Microbenchmarks These microbenchmarks are very favourable for VWA since the strings created have a length of 71, so they are embedded in VWA and allocated on the malloc heap for master. ``` +---------------------+-----------------+------------------+---------+-------------+ | | Branch (VWA on) | Branch (VWA off) | Master | VWA speedup | +--------------------+-----------------+------------------+---------+-------------+ | String#== (i/s) | 1.806k | 1.334k | 1.337k | 1.35x | | String#times (i/s) | 6.010M | 5.238M | 5.247M | 1.15x | | String#[]= (i/s) | 1.031k | 863.816 | 897.915 | 1.15x | +--------------------+-----------------+------------------+---------+-------------+ ``` {{collapse(Benchmark source code) ```ruby require "bundler/inline" gemfile do source "https://rubygems.org" gem "benchmark-ips" end COUNT = 10_000 strs1 = [] strs2 = [] nine = "9" COUNT.times do strs1 << [*"A".."Z", *"0".."9"].join(" ") strs2 << [*"A".."Z", *"0".."9"].join(" ") end Benchmark.ips do |x| x.report("String#==") do |times| i = 0 while i < times COUNT.times { |i| strs1[i] == strs2[i] } i += 1 end end x.report("String#times") do |times| i = 0 while i < times "a" * 100 i += 1 end end x.report("String#[]=") do |times| i = 0 while i < times strs1.each { |str| str[-1] = nine } i += 1 end end end ``` }} -- https://bugs.ruby-lang.org/ Unsubscribe: