From: "bbxiao1 (Xiao Ba)" Date: 2013-06-12T04:15:56+09:00 Subject: [ruby-core:55444] [ruby-trunk - Bug #8516][Open] IO#readchar returns wrong codepoints when converting encoding Issue #8516 has been reported by bbxiao1 (Xiao Ba). ---------------------------------------- Bug #8516: IO#readchar returns wrong codepoints when converting encoding https://bugs.ruby-lang.org/issues/8516 Author: bbxiao1 (Xiao Ba) Status: Open Priority: Normal Assignee: Category: Target version: ruby -v: ruby 1.9.3p429 (2013-05-15 revision 40747) [x86_64-darwin11.4.2] Backport: 1.9.3: UNKNOWN, 2.0.0: UNKNOWN I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files. $ file -i utf_8.txt utf_8.txt: text/plain; charset=utf-8 $ file -i iso_8859_1.txt iso_8859_1.txt: text/plain; charset=iso-8859-1 Code: utf_8_file = "utf_8.txt" iso_file = "iso_8859_1.txt" puts "Processing #{utf_8_file}" File.open(utf_8_file) do |io| line, char = "", nil until io.eof? || char == ?\n || char == ?\r char = io.readchar puts "Character #{char} has #{char.each_codepoint.count} codepoints" puts "Character #{char} codepoints: #{char.each_codepoint.to_a.join}" puts "SLICE FAIL" unless char == char.slice(0,1) line << char end line end puts "\n" puts "Processing #{iso_file}" File.open(iso_file) do |io| io.set_encoding("#{Encoding::ISO_8859_1}:#{Encoding::UTF_8}") line, char = "", nil until io.eof? || char == ?\n || char == ?\r char = io.readchar puts "Character #{char} has #{char.each_codepoint.count} codepoints" puts "Character #{char} codepoints: #{char.each_codepoint.to_a.join(', ')}" puts "SLICE FAIL" unless char == char.slice(0,1) line << char end line end Output: Processing utf_8.txt Character �� has 1 codepoints Character �� codepoints: 225 Character �� has 1 codepoints Character �� codepoints: 193 Character �� has 1 codepoints Character �� codepoints: 240 Character has 1 codepoints Character codepoints: 10 Processing iso_8859_1.txt Character �� has 2 codepoints Character �� codepoints: 195, 161 SLICE FAIL Character �� has 2 codepoints Character �� codepoints: 195, 129 SLICE FAIL Character �� has 2 codepoints Character �� codepoints: 195, 176 SLICE FAIL Character has 1 codepoints Character codepoints: 10 With the ISO-8859-1 encoded file, readchar is returning the character bytes when I would expect UTF-8 codepoints. -- http://bugs.ruby-lang.org/