[#104307] Float truncate — Eustáquio Rangel <eustaquiorangel@...>
Hi!
4 messages
2021/06/16
[ruby-core:104276] [Ruby master Bug#17990] Inconsistent behavior of Regexp quantifiers over characters with complex case foldings
From:
jiri.marsik@...
Date:
2021-06-15 11:59:28 UTC
List:
ruby-core #104276
Issue #17990 has been reported by jirkamarsik (Jirka Marsik).
----------------------------------------
Bug #17990: Inconsistent behavior of Regexp quantifiers over characters with complex case foldings
https://bugs.ruby-lang.org/issues/17990
* Author: jirkamarsik (Jirka Marsik)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN
----------------------------------------
With case insensitive Regexps, the string `"ff"` is considered equal to the string `"\ufb00"` with a single ligature character.
```
irb(main):001:0> /ff/i.match("\ufb00")
=> #<MatchData "ff">
```
This behavior also persists when the string `"ff"` doesn't appear literally in the Regexp source but is expressed using a fixed-length quantifier, as in the following:
```
irb(main):002:0> /f{2}/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):003:0> /f{2,2}/i.match("\ufb00")
=> #<MatchData "ff">
```
However, this doesn't hold in general. When using other quantifiers, the ligature character `"\ufb00"` is not recognized a sequence of two `"f"` characters.
```
irb(main):004:0> /f*/i.match("\ufb00")
=> #<MatchData "">
irb(main):005:0> /f+/i.match("\ufb00")
=> nil
irb(main):006:0> /f{1,}/i.match("\ufb00")
=> nil
irb(main):007:0> /f{1,2}/i.match("\ufb00")
=> nil
irb(main):008:0> /f{,2}/i.match("\ufb00")
=> #<MatchData "">
irb(main):009:0> /ff?/i.match("\ufb00")
=> nil
```
This leads to inconsistent behavior where a Regexp like `/f{1,2}/i` matches *fewer* strings than the more strict Regexp `/f{2,2}/i`.
I suspect that this is caused by the pattern analyzer directly expanding `/f{2}/i` and `/f{2,2}/i` into `/ff/i`. However, this optimization then changes the semantics of the Regexp, as it is otherwise impossible to match a single ligature character via multiple repetitions of a quantified expression.
While experimenting with this case, I have also discovered a related issue (caused by the problematic expansions of `/f{n}/i` and the issue reported here: https://bugs.ruby-lang.org/issues/17989).
These match:
```
/f{100}/i.match("f" * 100)
/f{100}/i.match("\ufb00" * 50)
/f{100}/i.match("\ufb00" * 49 + "ff")
/f{100}/i.match("ff" + "\ufb00" * 49)
```
However, this doesn't match:
```
/f{100}/i.match("f" + "\ufb00" * 49 + "f")
```
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>