From: jiri.marsik@... Date: 2021-06-28T09:09:48+00:00 Subject: [ruby-core:104422] [Ruby master Bug#18009] Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection Issue #18009 has been reported by jirkamarsik (Jirka Marsik). ---------------------------------------- Bug #18009: Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection https://bugs.ruby-lang.org/issues/18009 * Author: jirkamarsik (Jirka Marsik) * Status: Open * Priority: Normal * ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux] * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN ---------------------------------------- This is a follow up to [issue 4044](https://bugs.ruby-lang.org/issues/4044). Its fix (https://github.com/k-takata/Onigmo/issues/4) handled the cases that were reported in the original issue, but there are other cases, which were omitted and now produce inconsistent results. If the `\w` character set is used inside a nested negated character class, it will not be picked up by the part of the character class analyzer that's responsible for limiting the case-folding of certain character sets (like `\w` and `\W`) across the ASCII boundary. We then end up with the situation where `/[^\w]/iu` and `/[[^\w]]/iu` match different sets of characters. ``` irb(main):001:0> ("a".."z").to_a.join.scan(/\W/iu) => [] irb(main):002:0> ("a".."z").to_a.join.scan(/[^\w]/iu) => [] irb(main):003:0> ("a".."z").to_a.join.scan(/[[^\w]]/iu) => ["k", "s"] ``` This can also be demonstrated using the inverted matcher: ``` irb(main):004:0> ("a".."z").to_a.join.scan(/\w/iu).length => 26 irb(main):005:0> ("a".."z").to_a.join.scan(/[^[^\w]]/iu).length => 24 ``` A similar issue also arises when using character class intersection. The idea behind the pattern compiler's analysis is that characters are allowed to case-fold across the ASCII boundary only if they are included in the character class by some other means than just being included in `\w` (or in one of several other character sets which have special treatment). Therefore, in the below, `/[\w]/iu` will not match the Kelvin sign `\u212a`, because that would mean crossing the ASCII boundary from `k` to `\u212a`. However, `/[kx]/iu` will match the Kelvin sign, because the `k` was not contributed by `\w` and therefore is not subject to the ASCII boundary restriction (we have to use `/[kx]/iu` instead of `/[k]/iu` in our examples, or else the pattern analyzer would replace `[k]` with `k` and follow a different code path). ``` irb(main):006:0> /[\w]/iu.match("\u212a") => nil irb(main):007:0> /[kx]/iu.match("\u212a") => # ``` The problem then is when we perform an intersection of these two character sets. Since `[kx]` is a subset of `\w`, we would expect their intersection to behave the same as `[kx]`, but that is not the case. ``` irb(main):008:0> /[\w&&kx]/i.match("\u212a") => nil ``` The underlying issue in these cases is the manner in which the `ascCc` character set is computed during the parsing of character classes. The `ascCc` character set should contain all characters of the character class except those which were contributed by `\w` and similar character sets. This is done in a way that these character sets are essentially ignored in the calculation of `ascCc`, which works well for set union and top-most negation (which is handled explicitly), but it doesn't handle nested set negation and set intersection. -- https://bugs.ruby-lang.org/ Unsubscribe: