[#115212] [Ruby master Bug#19983] Nested * seems incorrect — "Eregon (Benoit Daloze) via ruby-core" <ruby-core@...>

Issue #19983 has been reported by Eregon (Benoit Daloze).

9 messages 2023/11/01

[#115226] [Ruby master Bug#19984] `make test-bundler-parallel` fails with ` --enable-shared` — "vo.x (Vit Ondruch) via ruby-core" <ruby-core@...>

Issue #19984 has been reported by vo.x (Vit Ondruch).

7 messages 2023/11/02

[#115227] [Ruby master Feature#19985] Support `Pathname` for `require` — "vo.x (Vit Ondruch) via ruby-core" <ruby-core@...>

Issue #19985 has been reported by vo.x (Vit Ondruch).

14 messages 2023/11/02

[#115259] [Ruby master Bug#19990] Could we reconsider the second argument to Kernel#load? — "fxn (Xavier Noria) via ruby-core" <ruby-core@...>

SXNzdWUgIzE5OTkwIGhhcyBiZWVuIHJlcG9ydGVkIGJ5IGZ4biAoWGF2aWVyIE5vcmlhKS4NDQoN

9 messages 2023/11/06

[#115304] [Ruby master Feature#19993] Optionally Free all memory at exit — "HParker (Adam Hess) via ruby-core" <ruby-core@...>

Issue #19993 has been reported by HParker (Adam Hess).

8 messages 2023/11/08

[#115333] [Ruby master Misc#19997] DevMeeting-2023-11-30 — "mame (Yusuke Endoh) via ruby-core" <ruby-core@...>

Issue #19997 has been reported by mame (Yusuke Endoh).

15 messages 2023/11/10

[#115334] [Ruby master Feature#19998] Emit deprecation warnings when the old (non-Typed) Data_XXX API is used — "byroot (Jean Boussier) via ruby-core" <ruby-core@...>

Issue #19998 has been reported by byroot (Jean Boussier).

12 messages 2023/11/10

[#115388] [Ruby master Feature#20005] Add C API to return symbols of native extensions resolved from features — "tagomoris (Satoshi Tagomori) via ruby-core" <ruby-core@...>

Issue #20005 has been reported by tagomoris (Satoshi Tagomori).

14 messages 2023/11/14

[#115422] [Ruby master Bug#20009] Marshal.load raises exception when load dumped class include non-ASCII — "ippachi (Kazuya Hatanaka) via ruby-core" <ruby-core@...>

SXNzdWUgIzIwMDA5IGhhcyBiZWVuIHJlcG9ydGVkIGJ5IGlwcGFjaGkgKEthenV5YSBIYXRhbmFr

14 messages 2023/11/19

[#115428] [Ruby master Feature#20011] Reduce implicit array allocations on caller side of method calling — "jeremyevans0 (Jeremy Evans) via ruby-core" <ruby-core@...>

Issue #20011 has been reported by jeremyevans0 (Jeremy Evans).

8 messages 2023/11/20

[#115438] [Ruby master Misc#20013] Travis CI status — "jaruga (Jun Aruga) via ruby-core" <ruby-core@...>

Issue #20013 has been reported by jaruga (Jun Aruga).

51 messages 2023/11/21

[#115484] [Ruby master Bug#20022] GC.verify_compaction_references does not actually move alll objects — "kjtsanaktsidis (KJ Tsanaktsidis) via ruby-core" <ruby-core@...>

Issue #20022 has been reported by kjtsanaktsidis (KJ Tsanaktsidis).

7 messages 2023/11/27

[#115491] [Ruby master Feature#20024] SyntaxError subclasses — "kddnewton (Kevin Newton) via ruby-core" <ruby-core@...>

Issue #20024 has been reported by kddnewton (Kevin Newton).

17 messages 2023/11/27

[#115525] [Ruby master Feature#20027] Range Deconstruction — "stuyam (Stuart Yamartino) via ruby-core" <ruby-core@...>

Issue #20027 has been reported by stuyam (Stuart Yamartino).

8 messages 2023/11/28

[#115552] [Ruby master Misc#20032] Propose @kjtsanaktsidis as a commiter — "jeremyevans0 (Jeremy Evans) via ruby-core" <ruby-core@...>

Issue #20032 has been reported by jeremyevans0 (Jeremy Evans).

15 messages 2023/11/30

[ruby-core:115550] [Ruby master Bug#20031] Regexp using greedy quantifier and unions on a big string uses a lot of memory

From: "FabienChaynes (Fabien Chaynes) via ruby-core" <ruby-core@...>
Date: 2023-11-30 17:41:26 UTC
List: ruby-core #115550
Issue #20031 has been reported by FabienChaynes (Fabien Chaynes).

----------------------------------------
Bug #20031: Regexp using greedy quantifier and unions on a big string uses a lot of memory
https://bugs.ruby-lang.org/issues/20031

* Author: FabienChaynes (Fabien Chaynes)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
* Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN
----------------------------------------
Trying to match on some regexp using a greedy quantifier on any character (`.*`) and unions (`foo|bar`) uses a lot of memory if the string we want to match on is big.

# Reproduction code

```ruby
puts RUBY_DESCRIPTION

unless File.exist?("/tmp/fake_email.eml")
  content = "Accept-Language: fr-FR, en-US#{"0" * 50_000_000}"
  File.write("/tmp/fake_email.eml", content)
end

content = File.read("/tmp/fake_email.eml")

def print_size(context, sike_kb)
  puts "#{context}: #{(sike_kb / 1024.0).round(1)} MB"
end

print_size("baseline", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)

print_size("content", content.bytesize / 1024)

REGEXP_UNION = Regexp.union(
  "Accept-Language:",
  "Thread-Topic:",
).freeze
reg = /.*#{REGEXP_UNION}/
p reg

GC.start
print_size("before match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)
allocated = GC.stat(:total_allocated_objects)

p content.match(reg)

p GC.stat(:total_allocated_objects) - allocated
GC.start
print_size("after match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)
p "---"
```

Output:
```bash
$ /usr/bin/time -v ruby full_repro.rb
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
baseline: 71.3 MB
content: 47.7 MB
/.*(?-mix:Accept\-Language:|Thread\-Topic:)/
before match?: 71.3 MB
#<MatchData "Accept-Language:">
14
after match?: 71.6 MB
"---"
  Command being timed: "ruby full_repro.rb"
  User time (seconds): 1.15
  System time (seconds): 0.74
  Percent of CPU this job got: 100%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.89
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 2423108
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 609575
  Voluntary context switches: 52
  Involuntary context switches: 12
  Swaps: 0
  File system inputs: 0
  File system outputs: 0
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0
```

strace:
```bash
$ sudo strace -p 588834
strace: Process 588834 attached
[...]
mmap(NULL, 249856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92dd3000
mremap(0x7f8e92dd3000, 249856, 495616, MREMAP_MAYMOVE) = 0x7f8e92d5a000
mremap(0x7f8e92d5a000, 495616, 987136, MREMAP_MAYMOVE) = 0x7f8e92c69000
mremap(0x7f8e92c69000, 987136, 1970176, MREMAP_MAYMOVE) = 0x7f8e92a88000
mremap(0x7f8e92a88000, 1970176, 3936256, MREMAP_MAYMOVE) = 0x7f8e926c7000
mremap(0x7f8e926c7000, 3936256, 7868416, MREMAP_MAYMOVE) = 0x7f8e91f46000
mremap(0x7f8e91f46000, 7868416, 15732736, MREMAP_MAYMOVE) = 0x7f8e91045000
mremap(0x7f8e91045000, 15732736, 31461376, MREMAP_MAYMOVE) = 0x7f8e8a8af000
mremap(0x7f8e8a8af000, 31461376, 62918656, MREMAP_MAYMOVE) = 0x7f8e86cae000
mremap(0x7f8e86cae000, 62918656, 125833216, MREMAP_MAYMOVE) = 0x7f8e7f4ad000
mremap(0x7f8e7f4ad000, 125833216, 251662336, MREMAP_MAYMOVE) = 0x7f8e704ac000
mremap(0x7f8e704ac000, 251662336, 503320576, MREMAP_MAYMOVE) = 0x7f8e524ab000
mremap(0x7f8e524ab000, 503320576, 1006637056, MREMAP_MAYMOVE) = 0x7f8e164aa000
mremap(0x7f8e164aa000, 1006637056, 2013270016, MREMAP_MAYMOVE) = 0x7f8d9e4a9000
mremap(0x7f8d9e4a9000, 2013270016, 4026535936, MREMAP_MAYMOVE) = 0x7f8cae4a8000
mmap(NULL, 12500992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92224000
munmap(0x7f8cae4a8000, 4026535936)      = 0
munmap(0x7f8e92224000, 12500992)        = 0
writev(1, [{iov_base="#<MatchData \"Accept-Language:\">", iov_len=31}, {iov_base="\n", iov_len=1}], 2) = 32
[...]
```

We can see that the memory retained by Ruby is low (71.6 MB), but `content.match(reg)` used 2_423_108 kbytes (~2.4 GB) in the regex engine.

# Findings

To reproduce we need:
- a greedy quantifier on any character (`.*`)
- a union with at least two members (`foo|bar`). More members won't increase the memory consumption further
- to perform the match on a long string

The memory seems to increase linearly against the string size:
|String size |Memory consumed |
|--|--|
| 11.9 MB | 629_988 kbytes |
| 23.8 MB | 1_235_976 kbytes |
| 47.7 MB | 2_423_076 kbytes |

I was able to reproduce it on old Ruby versions as well.

@byroot helped me to put all this together and according to his investigation the memory allocation would be coming from `regexec.c:4100` (https://github.com/ruby/ruby/blob/85092ecd6f5c4d12d0cb1d6dfa7040337a4f558b/regexec.c#L4100):
```
size_t match_cache_buf_length = (num_match_cache_points >> 3) + (num_match_cache_points & 7 ? 1 : 0) + 1;
uint8_t* match_cache_buf = (uint8_t*)xmalloc(match_cache_buf_length * sizeof(uint8_t));
```

It seems surprising to use that much memory compared to the string size. Is it expected?



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

In This Thread

Prev Next