From: "NARUSE, Yui" Date: 2008-09-16T04:53:18+09:00 Subject: [ruby-core:18610] Re: [Bug #564] Regexp fails on UTF-16 & UTF-32 character encodings Hi, James Gray wrote: > On Sep 15, 2008, at 3:49 AM, Michael Selig wrote: > >> On Mon, 15 Sep 2008 18:08:14 +1000, Tanaka Akira wrote: >> >>> In article <48cddb5533ad_8725cd9524342@redmine.ruby-lang.org>, >>> Michael Selig writes: >>> >>>> UTF-16 & UTF-32 (and maybe other non-ascii compatible encodings) >>>> don't seem to be work as Regexp patterns. >>>> >>>> Regexp.new("abc".encode("UTF-16BE")) >>>> ==> EncodingCompatibilityError: incompatible character encodings: >>>> US-ASCII and UTF-16BE >>> >>> % ruby -ve 'p Regexp.new("abc".encode("UTF-16BE")) =~ >>> "abc".encode("UTF-16BE")' >>> ruby 1.9.0 (2008-09-15 revision 19356) [i686-linux] >>> 0 >> >> I see, I have diagnosed the problem wrongly. I was using irb. >> >> ruby -ve 'p Regexp.new("abc".encode("UTF-16BE"))' >> ruby 1.9.0 (2008-09-03 revision 19073) [i686-linux] >> -e:1:in `p': incompatible character encodings: UTF-16BE and ASCII-8BIT >> (EncodingCompatibilityError) >> from -e:1:in `
' >> >> This is the error I was getting in irb, and I mistakenly assumed it >> was from the Regexp::new. >> It is a different problem - not as bad as I thought! > > So it's inspect() that has the issues, right? YES, a reason of this problem is Regexp#inspect. So a patch is following. --- re.c (revision 19371) +++ re.c (working copy) @@ -381,7 +381,7 @@ rb_reg_desc(const char *s, long len, VAL { VALUE str = rb_str_buf_new2("/"); - rb_enc_copy(str, re); + rb_enc_associate(str, rb_usascii_encoding()); rb_reg_expr_str(str, s, len); rb_str_buf_cat2(str, "/"); if (re) { The result of Regexp#inspect is only for see the content of regexp to debug, so there may be no reason to keep original encoding. # Of course Regexp#source must keep it. Anyway, Regexp#to_s is alias of Regexp#source now. But Regexp#inspect is more readble. How about make Regexp#to_s as alias of Regexp#inspect ? * r1 = /ab+c/ix #=> /ab+c/ix * s1 = r1.to_s #=> "(?ix-m:ab+c)" * r2 = Regexp.new(s1) #=> /(?ix-m:ab+c)/ * r1 == r2 #=> false * r1.source #=> "ab+c" * r2.source #=> "(?ix-m:ab+c)" -- NARUSE, Yui