[#104169] [Ruby master Feature#17938] Keyword alternative for boolean positional arguments — matheusrichardt@...

Issue #17938 has been reported by matheusrich (Matheus Richard).

12 messages 2021/06/04

[#104213] [Ruby master Feature#17942] Add a `initialize(public @a, private @b)` shortcut syntax for defining public/private accessors for instance vars — tyler@...

Issue #17942 has been reported by TylerRick (Tyler Rick).

6 messages 2021/06/09

[#104288] [Ruby master Bug#17992] Upstreaming the htmlentities gem into CGI#.(un)escape_html — alexandermomchilov@...

Issue #17992 has been reported by AMomchilov (Alexander Momchilov).

9 messages 2021/06/15

[#104338] [Ruby master Misc#17997] DevelopersMeeting20210715Japan — mame@...

Issue #17997 has been reported by mame (Yusuke Endoh).

10 messages 2021/06/17

[#104361] [Ruby master Bug#18000] have_library doesn't work when ruby is compiled with --disable-shared --disable-install-static-library — jean.boussier@...

Issue #18000 has been reported by byroot (Jean Boussier).

9 messages 2021/06/18

[#104401] [Ruby master Feature#18007] Help developers of C extensions meet requirements in "doc/extension.rdoc" — mike.dalessio@...

Issue #18007 has been reported by mdalessio (Mike Dalessio).

16 messages 2021/06/25

[#104430] [Ruby master Bug#18011] `Method#parameters` is incorrect for forwarded arguments — josh.cheek@...

Issue #18011 has been reported by josh.cheek (Josh Cheek).

12 messages 2021/06/29

[ruby-core:104422] [Ruby master Bug#18009] Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection

From: jiri.marsik@...
Date: 2021-06-28 09:09:48 UTC
List: ruby-core #104422
Issue #18009 has been reported by jirkamarsik (Jirka Marsik).

----------------------------------------
Bug #18009: Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection
https://bugs.ruby-lang.org/issues/18009

* Author: jirkamarsik (Jirka Marsik)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN
----------------------------------------
This is a follow up to [issue 4044](https://bugs.ruby-lang.org/issues/4044). Its fix (https://github.com/k-takata/Onigmo/issues/4) handled the cases that were reported in the original issue, but there are other cases, which were omitted and now produce inconsistent results.

If the `\w` character set is used inside a nested negated character class, it will not be picked up by the part of the character class analyzer that's responsible for limiting the case-folding of certain character sets (like `\w` and `\W`) across the ASCII boundary. We then end up with the situation where `/[^\w]/iu` and `/[[^\w]]/iu` match different sets of characters.

```
irb(main):001:0> ("a".."z").to_a.join.scan(/\W/iu)
=> []
irb(main):002:0> ("a".."z").to_a.join.scan(/[^\w]/iu)
=> []
irb(main):003:0> ("a".."z").to_a.join.scan(/[[^\w]]/iu)
=> ["k", "s"]
```

This can also be demonstrated using the inverted matcher:

```
irb(main):004:0> ("a".."z").to_a.join.scan(/\w/iu).length
=> 26
irb(main):005:0> ("a".."z").to_a.join.scan(/[^[^\w]]/iu).length
=> 24
```

A similar issue also arises when using character class intersection. The idea behind the pattern compiler's analysis is that characters are allowed to case-fold across the ASCII boundary only if they are included in the character class by some other means than just being included in `\w` (or in one of several other character sets which have special treatment). Therefore, in the below, `/[\w]/iu` will not match the Kelvin sign `\u212a`, because that would mean crossing the ASCII boundary from `k` to `\u212a`. However, `/[kx]/iu` will match the Kelvin sign, because the `k` was not contributed by `\w` and therefore is not subject to the ASCII boundary restriction (we have to use `/[kx]/iu` instead of `/[k]/iu` in our examples, or else the pattern analyzer would replace `[k]` with `k` and follow a different code path).

```
irb(main):006:0> /[\w]/iu.match("\u212a")
=> nil
irb(main):007:0> /[kx]/iu.match("\u212a")
=> #<MatchData "邃ェ">
```

The problem then is when we perform an intersection of these two character sets. Since `[kx]` is a subset of `\w`, we would expect their intersection to behave the same as `[kx]`, but that is not the case.

```
irb(main):008:0> /[\w&&kx]/i.match("\u212a")
=> nil
```

The underlying issue in these cases is the manner in which the `ascCc` character set is computed during the parsing of character classes. The `ascCc` character set should contain all characters of the character class except those which were contributed by `\w` and similar character sets. This is done in a way that these character sets are essentially ignored in the calculation of `ascCc`, which works well for set union and top-most negation (which is handled explicitly), but it doesn't handle nested set negation and set intersection.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

In This Thread

Prev Next