From: duerst@... Date: 2014-07-27T09:15:07+00:00 Subject: [ruby-core:64072] [ruby-trunk - Bug #10097] Case-insensitive Regexp matching for Windows-1252 not working for ŠšŽžŒœÿŸ Issue #10097 has been updated by Martin D��rst. Nobuyoshi Nakada wrote: > Is this correct? > https://github.com/nobu/ruby/compare/windows-1252 Thanks a lot for this very quick work! Unfortunately, it's not correct. I haven't checked everything, but at least cp1252_get_case_fold_codes_by_str doesn't deal with the special cases in get_case_fold_codes_by_str for ss/SS/��. I suggest that we do some more exploratory work before addressing this bug directly. First, I suspect that other (windows-12xx,...) encodings have very similar problems. Second, I found this bug because I was trying to find out what information that the encoding primitives already provide for case folding and case conversion. I have only just started that, but depending on what I/we find, we may want/need to: 1) use this information and be done; 2) use this information and add some more information separately; 3) change this information (e.g. add or change some primitives) so that it covers all the needs for case conversion; 4) provide the information for case conversion completely separately. I suggest that we wait with fixing this bug until we are able to rule out choice 3). If there is a (short, up-to-date) summary of what each of the encoding primitives does, that would help me a lot (Japanese would be okay). ---------------------------------------- Bug #10097: Case-insensitive Regexp matching for Windows-1252 not working for ���������������� https://bugs.ruby-lang.org/issues/10097#change-48082 * Author: Martin D��rst * Status: Open * Priority: Normal * Assignee: * Category: * Target version: * ruby -v: 1.9.3p545 * Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN ---------------------------------------- By chance I had a look at enc/iso_8859_1.c and found ~~~C ENC_REPLICATE("Windows-1252", "ISO-8859-1") ~~~ on line 288. But this does not work for case folding: ~~~ruby # http://en.wikipedia.org/wiki/Windows-1252 s1 = "\u0160".encode 'windows-1252' # '��' r1 = Regexp.new("\u0161".encode('windows-1252'), Regexp::IGNORECASE) # /��/i s1 =~ r1 # => nil s2 = "\u0178".encode 'windows-1252' # '��' r2 = Regexp.new("\u00FF".encode('windows-1252'), Regexp::IGNORECASE) # /��/i s2 =~ r2 # => nil s3 = "\u00C0".encode 'windows-1252' # '��' r3 = Regexp.new("\u00E0".encode('windows-1252'), Regexp::IGNORECASE) # /��/i s3 =~ r3 # => 0 ~~~ So case-insensitive matching works when both characters are in iso-8859-1, but not when one (����) or both (������������) characters are not in iso-8859-1. -- https://bugs.ruby-lang.org/