From: "Eregon (Benoit Daloze) via ruby-core" Date: 2023-12-01T12:24:23+00:00 Subject: [ruby-core:115562] [Ruby master Bug#20031] Regexp using greedy quantifier and unions on a big string uses a lot of memory Issue #20031 has been updated by Eregon (Benoit Daloze). I played around a bit and both time and memory seem linear in the size of the input string on CRuby. However the memory usage is indeed huge, ~2GB for 50MB string, ~4GB for 100MB string, ~8GB for 200MB string, etc. I tried it on TruffleRuby which uses TRegex (a DFA regexp engine) for curiosity, and that uses 235MB for 50MB string, 334MB for 100MB string, 530MB for 200MB string. (with `/usr/bin/time -v ruby --experimental-options --engine.Compilation=false file.rb` to exclude the JIT memory to get more stable numbers). ---------------------------------------- Bug #20031: Regexp using greedy quantifier and unions on a big string uses a lot of memory https://bugs.ruby-lang.org/issues/20031#change-105497 * Author: FabienChaynes (Fabien Chaynes) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Trying to match on some regexp using a greedy quantifier on any character (`.*`) and unions (`foo|bar`) uses a lot of memory if the string we want to match on is big. # Reproduction code ```ruby puts RUBY_DESCRIPTION unless File.exist?("/tmp/fake_email.eml") content = "Accept-Language: fr-FR, en-US#{"0" * 50_000_000}" File.write("/tmp/fake_email.eml", content) end content = File.read("/tmp/fake_email.eml") def print_size(context, sike_kb) puts "#{context}: #{(sike_kb / 1024.0).round(1)} MB" end print_size("baseline", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) print_size("content", content.bytesize / 1024) REGEXP_UNION = Regexp.union( "Accept-Language:", "Thread-Topic:", ).freeze reg = /.*#{REGEXP_UNION}/ p reg GC.start print_size("before match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) allocated = GC.stat(:total_allocated_objects) p content.match(reg) p GC.stat(:total_allocated_objects) - allocated GC.start print_size("after match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) p "---" ``` Output: ```bash $ /usr/bin/time -v ruby full_repro.rb ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] baseline: 71.3 MB content: 47.7 MB /.*(?-mix:Accept\-Language:|Thread\-Topic:)/ before match?: 71.3 MB # 14 after match?: 71.6 MB "---" Command being timed: "ruby full_repro.rb" User time (seconds): 1.15 System time (seconds): 0.74 Percent of CPU this job got: 100% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.89 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2423108 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 609575 Voluntary context switches: 52 Involuntary context switches: 12 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` strace: ```bash $ sudo strace -p 588834 strace: Process 588834 attached [...] mmap(NULL, 249856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92dd3000 mremap(0x7f8e92dd3000, 249856, 495616, MREMAP_MAYMOVE) = 0x7f8e92d5a000 mremap(0x7f8e92d5a000, 495616, 987136, MREMAP_MAYMOVE) = 0x7f8e92c69000 mremap(0x7f8e92c69000, 987136, 1970176, MREMAP_MAYMOVE) = 0x7f8e92a88000 mremap(0x7f8e92a88000, 1970176, 3936256, MREMAP_MAYMOVE) = 0x7f8e926c7000 mremap(0x7f8e926c7000, 3936256, 7868416, MREMAP_MAYMOVE) = 0x7f8e91f46000 mremap(0x7f8e91f46000, 7868416, 15732736, MREMAP_MAYMOVE) = 0x7f8e91045000 mremap(0x7f8e91045000, 15732736, 31461376, MREMAP_MAYMOVE) = 0x7f8e8a8af000 mremap(0x7f8e8a8af000, 31461376, 62918656, MREMAP_MAYMOVE) = 0x7f8e86cae000 mremap(0x7f8e86cae000, 62918656, 125833216, MREMAP_MAYMOVE) = 0x7f8e7f4ad000 mremap(0x7f8e7f4ad000, 125833216, 251662336, MREMAP_MAYMOVE) = 0x7f8e704ac000 mremap(0x7f8e704ac000, 251662336, 503320576, MREMAP_MAYMOVE) = 0x7f8e524ab000 mremap(0x7f8e524ab000, 503320576, 1006637056, MREMAP_MAYMOVE) = 0x7f8e164aa000 mremap(0x7f8e164aa000, 1006637056, 2013270016, MREMAP_MAYMOVE) = 0x7f8d9e4a9000 mremap(0x7f8d9e4a9000, 2013270016, 4026535936, MREMAP_MAYMOVE) = 0x7f8cae4a8000 mmap(NULL, 12500992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92224000 munmap(0x7f8cae4a8000, 4026535936) = 0 munmap(0x7f8e92224000, 12500992) = 0 writev(1, [{iov_base="#", iov_len=31}, {iov_base="\n", iov_len=1}], 2) = 32 [...] ``` We can see that the memory retained by Ruby is low (71.6 MB), but `content.match(reg)` used 2_423_108 kbytes (~2.4 GB) in the regex engine. # Findings To reproduce we need: - a greedy quantifier on any character (`.*`) - a union with at least two members (`foo|bar`). More members won't increase the memory consumption further - to perform the match on a long string The memory seems to increase linearly against the string size: |String size |Memory consumed | |--|--| | 11.9 MB | 629_988 kbytes | | 23.8 MB | 1_235_976 kbytes | | 47.7 MB | 2_423_076 kbytes | I was able to reproduce it on old Ruby versions as well. @byroot helped me to put all this together and according to his investigation the memory allocation would be coming from `regexec.c:4100` (https://github.com/ruby/ruby/blob/85092ecd6f5c4d12d0cb1d6dfa7040337a4f558b/regexec.c#L4100): ``` size_t match_cache_buf_length = (num_match_cache_points >> 3) + (num_match_cache_points & 7 ? 1 : 0) + 1; uint8_t* match_cache_buf = (uint8_t*)xmalloc(match_cache_buf_length * sizeof(uint8_t)); ``` It seems surprising to use that much memory compared to the string size. Is it expected? -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/