From: Martin Duerst Date: 2008-12-26T19:22:28+09:00 Subject: [ruby-core:20884] Fwd: [ruby-dev:37603] Re: [BUG:trunk] [m17n] TestCSVFeatures fails because of r20905 Hello James, Akira wrote the text below, and Matz said it should somehow get to you. I'm not sure whether Akira has the time to do this, so here's a short summary. Akira thinks that what you tried to do in CSV#inspect is to somehow produce an ASCII-compatible encoding. If that's your intent, then a simple force_encoding won't work well for UTF-16, because it will leave some 0x00 bytes in the string. What Akira proposes is to use e = Encoding::Converter.asciicompat_encoding(s.encoding) e ? s.encode(e) : s.force_encoding("ASCII-8BIT") i.e. to convert to an ASCII-compatible encoding from the current encoding if necessary and possible, otherwise to force the data to be interpreted as ASCII-8BIT. I have to admit that I didn't think about UTF-16 at all, but my guess is that the above code might not (at least not by itself) solve the problem that different pieces of data with different encodings will be concatenated, because if there is a piece in ISO-2022-JP, it will be converted to something called "stateless-ISO-2022-JP", whereas some other piece, originally in an ASCII-compatible encoding (e.g. UTF-8 or whatever) will be forced to ASCII-8BIT. On the same problem, Yugui suggested that the encoding of the string returned by inspect should be the encoding of the file. Regards, Martin. >Date: Fri, 26 Dec 2008 13:13:29 +0900 >From: Tanaka Akira >Subject: [ruby-dev:37603] Re: [BUG:trunk] [m17n] TestCSVFeatures fails >because of r20905 >To: ruby-dev@ruby-lang.org (ruby developers list) >In article <4953CC9F.7070603@airemix.jp>, > "NARUSE, Yui" writes: > >> 直感的には String#encode("ASCII-8BIT") は、 >> String#force_encode("ASCII-8BIT") と同じ効果になるべきに感じます。 > >あまり直感的に思えません。encode は文字を保存するようにバイ >ト列を変換するはずなのに、そうなっていません。 > >CSV#inspect をみると、ASCII 互換の encoding にしたい、という >意図を感るんですが、違うんでしょうか。UTF-16 が来たときの対 >策というか。 > >UTF-16 を考えると、force_encoding にすると、中身が文字として >ASCII の範囲内でも \0 がひとつおきに入って嬉しくないんじゃな >いでしょうか。 > >UTF-16 についての議論がどうなったかちゃんと覚えてないんです >が、もし UTF-16 は扱わないでもいいという話だったら、単純に >.encode("ASCII-8BIT") を消してしまうというのはどうでしょうか。 > >また、UTF-16 を扱うのであれば、UTF-16 に対応する ASCII 互換 >な encoding に変換するということで、 > > e = Encoding::Converter.asciicompat_encoding(s.encoding) > e ? s.encode(e) : s.force_encoding("ASCII-8BIT") > >とかはどうでしょう。 >-- >[田中 哲][たなか あきら][Tanaka Akira] #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp