[#55222] [ruby-trunk - Feature #8468][Feedback] Remove $SAFE — "shugo (Shugo Maeda)" <redmine@...>

20 messages 2013/06/01

[#55260] [ruby-trunk - Feature #8478][Open] The hash returned by Enumerable#group_by should have an empty array for its default value — "phiggins (Pete Higgins)" <pete@...>

8 messages 2013/06/02

[#55276] Re: [ruby-changes:28951] zzak:r41003 (trunk): * process.c: Improve Process::exec documentation — Tanaka Akira <akr@...>

2013/5/31 zzak <ko1@atdot.net>:

9 messages 2013/06/03

[#55306] [ruby-trunk - Feature #8490][Open] Bring ActiveSupport Enumerable#index_by to core — "rosenfeld (Rodrigo Rosenfeld Rosas)" <rr.rosas@...>

12 messages 2013/06/04

[#55330] [ruby-trunk - Feature #8499][Assigned] Importing Hash#slice, Hash#slice!, Hash#except, and Hash#except! from ActiveSupport — "mrkn (Kenta Murata)" <muraken@...>

30 messages 2013/06/06

[#55391] [ruby-trunk - Bug #8507][Open] Keyword splat does not convert arg to Hash — "stephencelis (Stephen Celis)" <stephen.celis@...>

16 messages 2013/06/09

[#55393] [ruby-trunk - Bug #8508][Open] Invalid byte sequence in UTF-8 (ArgumentError) in win32/registry.rb — "thasmo (Thomas Deinhamer)" <thasmo@...>

11 messages 2013/06/09

[#55528] [ruby-trunk - Bug #8538][Open] c method not pushed into the callstack when called, but popped when returned — deivid (David Rodríguez) <deivid.rodriguez@...>

9 messages 2013/06/17

[#55557] [ruby-trunk - misc #8543][Open] rb_iseq_load — "alvoskov (Alexey Voskov)" <alvoskov@...>

47 messages 2013/06/19

[#55558] [ruby-trunk - Feature #8544][Open] OpenURI should open 'file://' URIs — "silasdavis (Silas Davis)" <ruby-lang@...>

12 messages 2013/06/19

[#55580] [CommonRuby - Feature #8556][Open] MutexedDelegator as a trivial way to make an object thread-safe — "headius (Charles Nutter)" <headius@...>

19 messages 2013/06/21

[#55596] [ruby-trunk - Feature #8563][Open] Instance variable arguments — "sawa (Tsuyoshi Sawada)" <sawadatsuyoshi@...>

18 messages 2013/06/22

[#55638] [CommonRuby - Feature #8568][Open] Introduce RbConfig value for native word size, to avoid Fixnum#size use — "headius (Charles Nutter)" <headius@...>

18 messages 2013/06/24

[#55678] [ruby-trunk - Feature #8572][Open] Fiber should be a Enumerable — "mattn (Yasuhiro Matsumoto)" <mattn.jp@...>

13 messages 2013/06/28

[#55699] [ruby-trunk - Feature #8579][Open] Frozen string syntax — "charliesome (Charlie Somerville)" <charliesome@...>

20 messages 2013/06/29

[#55708] [ruby-trunk - Bug #8584][Assigned] Remove curses — "shugo (Shugo Maeda)" <redmine@...>

17 messages 2013/06/30

[ruby-core:55455] [ruby-trunk - Bug #8516] IO#readchar returns wrong codepoints when converting encoding

From: "nobu (Nobuyoshi Nakada)" <nobu@...>
Date: 2013-06-12 03:45:38 UTC
List: ruby-core #55455
Issue #8516 has been updated by nobu (Nobuyoshi Nakada).

Backport changed from 1.9.3: UNKNOWN, 2.0.0: UNKNOWN to 1.9.3: REQUIRED, 2.0.0: REQUIRED


----------------------------------------
Bug #8516: IO#readchar returns wrong codepoints when converting encoding
https://bugs.ruby-lang.org/issues/8516#change-39878

Author: bbxiao1 (Xiao Ba)
Status: Closed
Priority: Normal
Assignee: 
Category: 
Target version: 
ruby -v: ruby 1.9.3p429 (2013-05-15 revision 40747) [x86_64-darwin11.4.2]
Backport: 1.9.3: REQUIRED, 2.0.0: REQUIRED


I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.

$ file -i utf_8.txt
utf_8.txt: text/plain; charset=utf-8

$ file -i iso_8859_1.txt
iso_8859_1.txt: text/plain; charset=iso-8859-1

Code:
utf_8_file = "utf_8.txt"
iso_file = "iso_8859_1.txt"

puts "Processing #{utf_8_file}"
File.open(utf_8_file) do |io|
  line, char = "", nil

  until io.eof? || char == ?\n || char == ?\r
    char = io.readchar
    puts "Character #{char} has #{char.each_codepoint.count} codepoints"
    puts "Character #{char} codepoints: #{char.each_codepoint.to_a.join}"
    puts "SLICE FAIL" unless char == char.slice(0,1)
    line << char
  end

  line
end
puts "\n" 
puts "Processing #{iso_file}"
File.open(iso_file) do |io|
  io.set_encoding("#{Encoding::ISO_8859_1}:#{Encoding::UTF_8}")
  line, char = "", nil

  until io.eof? || char == ?\n || char == ?\r
    char = io.readchar
    puts "Character #{char} has #{char.each_codepoint.count} codepoints"
    puts "Character #{char} codepoints: #{char.each_codepoint.to_a.join(', ')}"
    puts "SLICE FAIL" unless char == char.slice(0,1)
    line << char
  end

  line
end

Output:
Processing utf_8.txt
Character á has 1 codepoints
Character á codepoints: 225
Character Á has 1 codepoints
Character Á codepoints: 193
Character ð has 1 codepoints
Character ð codepoints: 240
Character 
 has 1 codepoints
Character 
 codepoints: 10

Processing iso_8859_1.txt
Character á has 2 codepoints
Character á codepoints: 195, 161
SLICE FAIL
Character Á has 2 codepoints
Character Á codepoints: 195, 129
SLICE FAIL
Character ð has 2 codepoints
Character ð codepoints: 195, 176
SLICE FAIL
Character 
 has 1 codepoints
Character 
 codepoints: 10

With the ISO-8859-1 encoded file, readchar is returning the character bytes when I would expect UTF-8 codepoints.


-- 
http://bugs.ruby-lang.org/

In This Thread

Prev Next