From: "ben_h (Ben Hoskings)" Date: 2012-12-19T08:13:20+09:00 Subject: [ruby-core:50969] [ruby-trunk - Bug #4044] Regex matching errors when using \W character class and /i option Issue #4044 has been updated by ben_h (Ben Hoskings). Hi all, long time no see :) naruse (Yui NARUSE) wrote: > =begin > > The current behavior means that \W does not mean [^A-Za-z0-9_] in Ruby 1.9 in some cases. > > Unicode ignore case breaks it. > http://unicode.org/reports/tr21/ > > 212A; C; 006B; # KELVIN SIGN > 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S > http://www.unicode.org/Public/UNIDATA/CaseFolding.txt > > \W includes U+212A and U+00DF > /i adds U+006B (k) and U+0073 (S) to [\W] > ^ reverses the class; it doesn't include k & S. I think I see the misunderstanding: there are multiple characters that render as 'k' and 's'. K, S, k, s are basic word characters, and so [^\W] should match them (along with all A-Z and a-z): 0x004B (Latin capital letter K) 0x0053 (Latin capital letter S) 0x006B (Latin capital letter k) 0x0073 (Latin capital letter s) But, I'm not sure how [^\W] should treat these characters: 0x00DF (Latin small letter sharp s) 0x017F (Latin small letter long s) 0x212A (Kelvin sign) The important thing is that all the characters in A-Z (0x41-0x5A) & a-z (0x61-0x7A) are word characters, so [^\W] should match all of them. Cheers, Ben ---------------------------------------- Bug #4044: Regex matching errors when using \W character class and /i option https://bugs.ruby-lang.org/issues/4044#change-34835 Author: ben_h (Ben Hoskings) Status: Feedback Priority: Normal Assignee: naruse (Yui NARUSE) Category: core Target version: 1.9.2 ruby -v: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0] =begin Hi all, Josh Bassett and I just discovered an issue with regex matches on ruby-1.9.2p0. (We reduced it while we were hacking on gemcutter.) The case-insensitive (/i) option together with the non-word character class (\W) match inconsistently against the alphabet. Specifically the regex doesn't match properly against the letters 'k' and 's'. The following expression demonstrates the problem in irb: puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/i] ].inspect } As a reference, the following two expressions are working properly: puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/] ].inspect } puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[\w]/i] ].inspect } Cheers Ben Hoskings & Josh Bassett =end -- http://bugs.ruby-lang.org/