From: Michael Selig Date: 2008-10-31T07:14:21+09:00 Subject: [ruby-core:19646] Re: [Feature #695] More flexibility when combining ASCII-8BIT strings with other encodings Hi, Feature #695 was closed & marked done, but unfortunately it does not seem to have been implemented :-( The request was: > When combining 2 strings, with one being ASCII-8BIT, and the other is > encoding "E": > 1) If the ASCII-8BIT string is valid if forced to encoding E, then treat > the ASCII-8BIT string as being in encoding E; > 2) Otherwise treat both strings as ASCII-8BIT. > > Part (2) is less important, and can probably be omitted if it is hard to > implement. However: ruby -Kn -ve 'p "abc\xD8\xB5" + "abc\u0635"' ruby 1.9.0 (2008-10-30 revision 20062) [i686-linux] -e:1:in `
': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError) (The -Kn is only necessary here because with -e ruby uses the locale to determine the encoding of the string containing "\x".) I thought this feature was implemented very quickly! What appears to have been implemented is the encoding of "Array#pack" output with "U". However, I am not totally convinced that even this was done correctly, as the pack output seems now to be marked UTF-8 even if the pack option contains a mixture of "U" with other options which then can result in an invalid UTF-8 string. My feature request would mean that "pack" and "\x" string literals could be left as ASCII-8BIT, and be "forced" to another encoding transparently depending on how the programmer uses it. You can liken this feature to the transparent conversion of an integer to a float when doing arithmetic. If you agree that this is a good idea, I don't mind trying to produce a patch for it myself. Please let me know. Cheers Mike On Wed, 29 Oct 2008 14:53:15 +1100, Michael Selig wrote: > Feature #695: More flexibility when combining ASCII-8BIT strings with > other encodings > http://redmine.ruby-lang.org/issues/show/695 > > Author: Michael Selig > Status: Open, Priority: Normal > Category: M17N > > Consider the following 3 Ruby statements: > > # String#pack always returns ASCII-8BIT > s1 = [97, 98, 99, 1589].pack("U*") > > # \xNN returns the source encoding (even if it is an invalid string), or > ASCII-8BIT if not set > s2 = "abc\xD8\xB5" > > # \uNNNN always returns UTF-8 > s3 = "abc\u0635" > > All of s1, s2, and s3 have the same contents, but different encodings. > When you try to combine them, you get different "encoding compatibility" > problems, which can change depending on the source encoding, due to the > treatment of s2. > > I would like to see Ruby be able to combine all the above without error. > I don't think it is reasonable to have to use "force_encoding" in these > cases. This would > - give better compatibility with 1.8, > - make handling of methods returning ASCII-8BIT strings much easier (eg > String#pack and libraries which return strings in ASCII-8BIT because the > encoding is unknown) > - reduce the confusion caused with "\x" producing a string which depends > on the source encoding (which I dislike - I think it should always > return ASCII-8BIT). > > So the feature request is: > > When combining 2 strings, with one being ASCII-8BIT, and the other is > encoding "E": > 1) If the ASCII-8BIT string is valid if forced to encoding E, then treat > the ASCII-8BIT string as being in encoding E; > 2) Otherwise treat both strings as ASCII-8BIT. > > Part (2) is less important, and can probably be omitted if it is hard to > implement. > > Thank you > Michael Selig > > > ---------------------------------------- > http://redmine.ruby-lang.org