From: "Martin J. Dürst" Date: 2013-07-13T21:09:45+09:00 Subject: [ruby-core:55994] Re: [ruby-trunk - Bug #8630][Open] Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef Hello Charles, On 2013/07/13 6:26, Tanaka Akira wrote: > 2013/7/13 headius (Charles Nutter): >> Bug #8630: Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef >> https://bugs.ruby-lang.org/issues/8630 > >> When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes: >> >> "\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError >> I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are *valid* for transcoding, so the transcoding of high-bit bytes is by definition *invalid*, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are *invalid* as characters. > > No. I fully agree. > ASCII-8BIT consists 128 ASCII characters and 128 special characters to > represent 0x80 to 0xff binary bytes. That's one way to put it, but a better way is to say that ASCII-8BIT consists of 128 ASCII characters and 128 unassigned codepoints. This is similar to unassigned codepoints in UTF-8. > The special characters are not representable in UTF-8. > So UndefinedConversionError is raised. > > The validity of a characetr is defined by encoding, not transcoding. Yes. Valid means that the original data as is is valid, nothing more. It does not depend on the target encoding. And ASCII-8BIT of course can contain bytes 0x80 and beyond, that's its job. Regards, Martin.