[#107765] [Ruby master Bug#18605] Fails to run on (newer) 32bit Windows with ucrt — "lazka (Christoph Reiter)" <noreply@...>

Issue #18605 has been reported by lazka (Christoph Reiter).

8 messages 2022/03/03

[#107769] [Ruby master Misc#18609] keyword decomposition in enumerable (question/guidance) — "Ethan (Ethan -)" <noreply@...>

Issue #18609 has been reported by Ethan (Ethan -).

10 messages 2022/03/04

[#107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code — "chrisseaton (Chris Seaton)" <noreply@...>

Issue #18611 has been reported by chrisseaton (Chris Seaton).

12 messages 2022/03/07

[#107791] [Ruby master Bug#18614] Error (busy loop) inTestGemCommandsSetupCommand#test_destdir_flag_does_not_try_to_write_to_the_default_gem_home — duerst <noreply@...>

Issue #18614 has been reported by duerst (Martin D端rst).

7 messages 2022/03/08

[#107794] [Ruby master Feature#18615] Use -Werror=implicit-function-declaration by deault for building C extensions — "Eregon (Benoit Daloze)" <noreply@...>

Issue #18615 has been reported by Eregon (Benoit Daloze).

11 messages 2022/03/08

[#107832] [Ruby master Bug#18622] const_get still looks in Object, while lexical constant lookup no longer does — "Eregon (Benoit Daloze)" <noreply@...>

Issue #18622 has been reported by Eregon (Benoit Daloze).

16 messages 2022/03/10

[#107847] [Ruby master Bug#18625] ruby2_keywords does not unmark the hash if the receiving method has a *rest parameter — "Eregon (Benoit Daloze)" <noreply@...>

Issue #18625 has been reported by Eregon (Benoit Daloze).

13 messages 2022/03/11

[#107886] [Ruby master Feature#18630] Introduce general `IO#timeout` and `IO#timeout=`for all (non-)blocking operations. — "ioquatix (Samuel Williams)" <noreply@...>

Issue #18630 has been reported by ioquatix (Samuel Williams).

28 messages 2022/03/14

[#108026] [Ruby master Feature#18654] Enhancements to prettyprint — "kddeisz (Kevin Newton)" <noreply@...>

Issue #18654 has been reported by kddeisz (Kevin Newton).

9 messages 2022/03/22

[#108039] [Ruby master Feature#18655] Merge `IO#wait_readable` and `IO#wait_writable` into core — "byroot (Jean Boussier)" <noreply@...>

Issue #18655 has been reported by byroot (Jean Boussier).

10 messages 2022/03/23

[#108056] [Ruby master Bug#18658] Need openssl 3 support for Ubuntu 22.04 (Ruby 2.7.x and 3.0.x) — "schneems (Richard Schneeman)" <noreply@...>

Issue #18658 has been reported by schneems (Richard Schneeman).

19 messages 2022/03/24

[#108075] [Ruby master Bug#18663] Autoload doesn't work with fiber context switch. — "ioquatix (Samuel Williams)" <noreply@...>

Issue #18663 has been reported by ioquatix (Samuel Williams).

10 messages 2022/03/25

[#108117] [Ruby master Feature#18668] Merge `io-nonblock` gems into core — "Eregon (Benoit Daloze)" <noreply@...>

Issue #18668 has been reported by Eregon (Benoit Daloze).

22 messages 2022/03/30

[ruby-core:108033] [Ruby master Feature#18653] Use RE2 for Regexp

From: "mame (Yusuke Endoh)" <noreply@...>
Date: 2022-03-23 02:47:36 UTC
List: ruby-core #108033
Issue #18653 has been updated by mame (Yusuke Endoh).


vo.x (Vit Ondruch) wrote in #note-1:
> Could you please elaborate what was the motivation for this experiment?

My original motivation was a security measure against ReDoS. If RE2 worked well, we might not have to introduce Regexp.timeout= (#17837).

Also, I just wanted to give it a try because @eregon's talk in RubyKaigi 2021 was interesting to me.

----------------------------------------
Feature #18653: Use RE2 for Regexp
https://bugs.ruby-lang.org/issues/18653#change-96992

* Author: mame (Yusuke Endoh)
* Status: Rejected
* Priority: Normal
----------------------------------------
I have tried to use [RE2](https://github.com/google/re2) as Ruby's regular expression engine. As it turns out, it seems difficult to merge it to Ruby right now. But I'd like to share some of my findings for those who may consider doing the same in the future.

## What I did

Here is my prototype.

https://github.com/mame/ruby/tree/re2-prototype

My prototype first attempts to use RE2 for any Regexp matching. If it is impossible for any reason, it degenerates into onigmo, which is the regular expression engine that Ruby is currently using. This hybrid approach is the same as TruffleRuby's. Actually I was inspired by [@eregon's talk at RubyKaigi takeout 2021](https://rubykaigi.org/2021-takeout/presentations/eregontp.html).

My prototype degenerates into onigmo in the following cases.

* A Regexp uses any feature that RE2 does not support
  * For example, lookahead, back references (`\1`), and many advanced features are not supported by RE2.
  * Notably, RE2 does not support a large repeat like `/a{0,9999}/`, which is often used in some actual projects.
* The encoding of a match target string is not in UTF-8, US-ASCII, or ASCII-8BIT
  * This is because RE2 supports only UTF-8 and Latin1 encoding.
  * This means, the backend engine is not determined when the Regexp object is created. It can switch to each other depending on the encoding of a match target string.
* A Regexp uses any option but `//m`.
  * Even `//i` degenerates to onigmo because they are incompatible. In RE2, `/ss/i =~ "ß"` does not match.
  * We may support `//x` by preprocessing a pattern string to remove spaces.
* A Regexp includes `^`.
  * In onigmo, `^` matches "the beginning of the string" or "after \n and before any character".
  * In RE2, `^` matches "the beginning of the string" or "after \n".
  * For example, `"abc\n" =~ /^$/` does not match in onigmo, but does in RE2. Some actual projects (like rdoc) seem to depend on this behavior of onigmo.
* A Regexp includes `\b`.
  * In onigmo, `\b` matches the boundary of space and non-space.
  * In RE2, `\b` matches ASCII word boundary.
  * For example, `"α " =~ /.\b./` does match in onigmo, but does not in RE2.

Also, my prototype applies some preprocessing to a pattern string. For example, it replaces `\s` with `/[\t\n\v\f\r\x20]/` because `/\s/` does not match with `"\v"` in RE2.

## Evaluation

My prototype passes almost all tests in `make test-all` (except some tests that are checking warning messages emitted from onigmo).

By running `rails new foo && cd foo && rails s`, half of Regexps are processed by RE2: 865 unique Regexps are processed by RE2, and 811 unique Regexps are by onigmo.

I think that it is possible to increase the percentage of RE2 by increasing custom preprocessing, but I am not sure that it would pay for the complexity of adding new code.

In terms of performance, `make rdoc` takes 20.2s before the patch, and 22.6s after the patch ;-( I guess that degeneration to onigmo is the main overhead.

## Notes / Problems

* According to `make test-spec`, there are still some minor incompatibilities: for example, `/(a|())*/.match("aaa"); $1 #=> RE2: "a", onigmo: ""`.
* One of the main motivations to use RE2 is security measure against ReDoS. However, RE2 supports only UTF-8 and Latin1, so it seems difficult for us to satisfy this motivation (unless we decide to ignore non-ASCII / non-UTF-8 encodings).
* My prototype requires C++ compiler because RE2 only provides C++ API.
* RE2 seems not to provide interruption API. So, we cannot stop RE2 matching by Ctrl+C. (In general, RE2 matching is faster enough, but it can take longer when a target string is long enough.)




-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

In This Thread