From: matthew@... Date: 2016-02-03T22:44:23+00:00 Subject: [ruby-core:73689] [Ruby trunk Bug#4044] Regex matching errors when using \W character class and /i option Issue #4044 has been updated by Matthew Kerwin. Martin D��rst wrote: > On 2016/02/03 12:21, matthew@kerwin.net.au wrote: > > > I want to write a spec for this, but some of the details are unclear to me. Can we confirm whether each of the following are spec? > > Please don't just assume that the current behavior is spec. Indeed, that's why I asked. > If it > doesn't match with common sense in any way, it's very clear that we have > to fix it. There may be borderline cases that are up for discussion, but > at least most of the examples I have seen don't meet that criterion. > Confusion abounds. I thought that if there was a formal spec, at least that would give a solid grounding to start from. As it is we rely on implementations to describe what should/does happen, which is imperfect and allows us to confuse bugs with spec. (Right now I'm particularly interested in why `/[\W]/i =~ 'k' #=> nil`) > My understanding was that Ken Takata fixed the problem with r47598, but > I'll try to have another look at that. > > When I looked at Ken's solution last time > (the details are at the following link, in Japanese > https://github.com/k-takata/Onigmo/issues/4), it included some aspects > related to ASCII, which keeps confusing me. > I've looked at that issue, but I'm afraid I can't read Japanese (and Google translate only gets me so far.) I think I get the gist of it, but any subtlety is probably lost to me. > The relevant specification is Unicode Technical Standard #18, Unicode > Regular Expressions, in particular > http://www.unicode.org/reports/tr18/#Simple_Loose_Matches. There are > various choices at the end of that section that are relevant to this issue. > > My personal preference among the choices A-D is B. As far as I > understand it, it would mean that while a /i option would change how > literal characters are matched, it would not affect how it affects > properties such as \W. > I suppose we're in choice D at the moment (that would explain why `/\W/i` and `/[\W]/i` match differently,) but just which "specific properties and/or explicit character classes" remains unclear. Documenting those (and writing a spec) would help. > My justification for this is as follows: If I want e.g. a word > character, then that already should include all the necessary > characters, both upper and lower case (and title case just in case you > forgot about it :-). It's difficult to see why I'd want the set of > characters to change when adding /i. The same argument can be applied to > \W and most if not all similar cases. > When we were discussing it on Ruby Talk the other day I came up with this: * the '���' ligature is a non-word character * it has a case conversion, so is affected by the `//i` flag So: * `/���/` is a subset of `/\W/` * `/���/i` matches '���', 'FF', 'ff', 'fF', and 'Ff' * therefore `/\W/i` should match all of the above The first two dot points are where I see the contention. If I were to make a general rule, I'd say that "\W" should not be expanded for case-folding, since 'case' is a property of word characters. (If anything matches "\W" it is, by definition, not a word character, so should not be subject to word-type operations like case-folding.) If that were so, `/���/i` (and therefore `/\W/i`) would match '���' but not 'FF'. That would, I think, make `\W` a perfect complement to `\w` (identical to `[^\w]`); which seems to be what people expect. I think that means you and I are saying the same thing, in different ways. > The case that I think can be up for discussion is explicit character > classes, such as [a-z]. Here, in effect automatically adding A-Z (and > some other case equivalents) may indeed make sense. Certainly; I use `/[0-9a-f]/i` myself for matching hexadecimal numbers (and similar patterns for similar things.) However where would that leave us with `/[a-e\W]/i` ? ---------------------------------------- Bug #4044: Regex matching errors when using \W character class and /i option https://bugs.ruby-lang.org/issues/4044#change-56886 * Author: Ben Hoskings * Status: Closed * Priority: Normal * Assignee: Yui NARUSE * ruby -v: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0] * Backport: ---------------------------------------- =begin Hi all, Josh Bassett and I just discovered an issue with regex matches on ruby-1.9.2p0. (We reduced it while we were hacking on gemcutter.) The case-insensitive (/i) option together with the non-word character class (\W) match inconsistently against the alphabet. Specifically the regex doesn't match properly against the letters 'k' and 's'. The following expression demonstrates the problem in irb: puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/i] ].inspect } As a reference, the following two expressions are working properly: puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/] ].inspect } puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[\w]/i] ].inspect } Cheers Ben Hoskings & Josh Bassett =end -- https://bugs.ruby-lang.org/ Unsubscribe: