From: jiri.marsik@...
Date: 2021-06-28T09:09:48+00:00
Subject: [ruby-core:104422] [Ruby master Bug#18009] Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection

Issue #18009 has been reported by jirkamarsik (Jirka Marsik).

----------------------------------------
Bug #18009: Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection
https://bugs.ruby-lang.org/issues/18009

* Author: jirkamarsik (Jirka Marsik)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN
----------------------------------------
This is a follow up to [issue 4044](https://bugs.ruby-lang.org/issues/4044). Its fix (https://github.com/k-takata/Onigmo/issues/4) handled the cases that were reported in the original issue, but there are other cases, which were omitted and now produce inconsistent results.

If the `\w` character set is used inside a nested negated character class, it will not be picked up by the part of the character class analyzer that's responsible for limiting the case-folding of certain character sets (like `\w` and `\W`) across the ASCII boundary. We then end up with the situation where `/[^\w]/iu` and `/[[^\w]]/iu` match different sets of characters.

```
irb(main):001:0> ("a".."z").to_a.join.scan(/\W/iu)
=> []
irb(main):002:0> ("a".."z").to_a.join.scan(/[^\w]/iu)
=> []
irb(main):003:0> ("a".."z").to_a.join.scan(/[[^\w]]/iu)
=> ["k", "s"]
```

This can also be demonstrated using the inverted matcher:

```
irb(main):004:0> ("a".."z").to_a.join.scan(/\w/iu).length
=> 26
irb(main):005:0> ("a".."z").to_a.join.scan(/[^[^\w]]/iu).length
=> 24
```

A similar issue also arises when using character class intersection. The idea behind the pattern compiler's analysis is that characters are allowed to case-fold across the ASCII boundary only if they are included in the character class by some other means than just being included in `\w` (or in one of several other character sets which have special treatment). Therefore, in the below, `/[\w]/iu` will not match the Kelvin sign `\u212a`, because that would mean crossing the ASCII boundary from `k` to `\u212a`. However, `/[kx]/iu` will match the Kelvin sign, because the `k` was not contributed by `\w` and therefore is not subject to the ASCII boundary restriction (we have to use `/[kx]/iu` instead of `/[k]/iu` in our examples, or else the pattern analyzer would replace `[k]` with `k` and follow a different code path).

```
irb(main):006:0> /[\w]/iu.match("\u212a")
=> nil
irb(main):007:0> /[kx]/iu.match("\u212a")
=> #<MatchData "���">
```

The problem then is when we perform an intersection of these two character sets. Since `[kx]` is a subset of `\w`, we would expect their intersection to behave the same as `[kx]`, but that is not the case.

```
irb(main):008:0> /[\w&&kx]/i.match("\u212a")
=> nil
```

The underlying issue in these cases is the manner in which the `ascCc` character set is computed during the parsing of character classes. The `ascCc` character set should contain all characters of the character class except those which were contributed by `\w` and similar character sets. This is done in a way that these character sets are essentially ignored in the calculation of `ascCc`, which works well for set union and top-most negation (which is handled explicitly), but it doesn't handle nested set negation and set intersection.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>