From: jiri.marsik@... Date: 2021-06-29T12:05:15+00:00 Subject: [ruby-core:104440] [Ruby master Bug#18013] Unexpected results when mxiing negated character classes and case-folding Issue #18013 has been updated by jirkamarsik (Jirka Marsik). duerst (Martin D��rst) wrote in #note-2: > Just a question: What's the purpose of nested character classes? They are useful in combination with the set intersection operator `&&`. They let you, e.g., exclude characters from some character set, as in the example below, which considers all lowercase-letters except for the English vowels `aeiou`. ``` irb(main):001:0> /[\p{Ll}&&[^aeiou]]/u.match("a") => nil irb(main):002:0> /[\p{Ll}&&[^aeiou]]/u.match("b") => # irb(main):003:0> /[\p{Ll}&&[^aeiou]]/u.match(".") => nil irb(main):004:0> /[\p{Ll}&&[^aeiou]]/u.match("��") => # ``` ---------------------------------------- Bug #18013: Unexpected results when mxiing negated character classes and case-folding https://bugs.ruby-lang.org/issues/18013#change-92692 * Author: jirkamarsik (Jirka Marsik) * Status: Open * Priority: Normal * ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux] * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN ---------------------------------------- ``` irb(main):001:0> /[^a-c]/i.match("A") => nil irb(main):002:0> /[[^a-c]]/i.match("A") => # ``` The two regular expressions above match different strings, because the character classes denote different sets of characters. In order for `/[^a-c]/i` to produce correct results, Oniguruma provided a fix that can still be easily seen in the code as it is hidden behind an always-on preprocessor flag (`CASE_FOLD_IS_APPLIED_INSIDE_NEGATIVE_CCLASS`, https://github.com/ruby/ruby/blob/9eae8cdefba61e9e51feb30a4b98525593169666/regparse.c#L5528). The idea of the fix is to first case-fold a character class and only then apply the negation (essentially moving the case-fold operator *inside* the negation). In the case of our first regular expression, `[a-c]` is case-folded into `[a-cA-C]` and that is then inverted into `[^a-cA-C]`, which is the expected result. However, this case-folding logic is currently only being applied to the top-most character class and so if we use a nested negated character class, the order of the operations will be switched. With our second regular expression, `[a-c]` will first be negated to yield `[^a-c]`, which will then be case-folded into `.`, the set of all characters (since `[^a-c]` contains `A-C`, which case-fold into `a-c`). A way to fix this would be to apply case-folding for nested character classes as well, so that the nested character classes behave the same as the top-most character class. Then, we would get the same semantics for both expressions. -- https://bugs.ruby-lang.org/ Unsubscribe: