From: "stefan (Stefan Lang)" Date: 2012-09-30T20:18:34+09:00 Subject: [ruby-core:47751] [ruby-trunk - Bug #7090][Open] UTF-16LE String#<< append 0x0 for certain codepoints Issue #7090 has been reported by stefan (Stefan Lang). ---------------------------------------- Bug #7090: UTF-16LE String#<< append 0x0 for certain codepoints https://bugs.ruby-lang.org/issues/7090 Author: stefan (Stefan Lang) Status: Open Priority: Normal Assignee: Category: Target version: ruby -v: ruby 1.9.3p194 (2012-04-20) [x86_64-linux] $ irb193 -r unicode_utils/u irb(main):001:0> RUBY_VERSION => "1.9.3" irb(main):002:0> s1 = "".force_encoding('utf-16le') => "" irb(main):003:0> s1 << 0x20 => " " irb(main):004:0> s1 << 0x300 => " \u0000" irb(main):005:0> U.debug s1 Char | Ordinal | Sid | General Category | UTF-8 ------+---------+-------+------------------+------- " " | 20 | SPACE | Space_Separator | 20 N/A | 0 | NULL | Control | 00 => nil irb(main):006:0> s2 = "".force_encoding('utf-8') => "" irb(main):007:0> s2 << 0x20 => " " irb(main):008:0> s2 << 0x300 => " ��" irb(main):009:0> U.debug s2 Char | Ordinal | Sid | General Category | UTF-8 ------+---------+------------------------+------------------+------- " " | 20 | SPACE | Space_Separator | 20 N/A | 300 | COMBINING GRAVE ACCENT | Nonspacing_Mark | CC 80 => nil IMO, the behaviour with the UTF-8 string is correct. $ ri193 'String#<<' = String#<< (from ruby core) ------------------------------------------------------------------------------ str << integer -> str str.concat(integer) -> str str << obj -> str str.concat(obj) -> str ------------------------------------------------------------------------------ Append---Concatenates the given object to str. If the object is a Integer, it is considered as a codepoint, and is converted to a character before concatenation. a = "hello " a << "world" #=> "hello world" a.concat(33) #=> "hello world!" AFAIK, a Ruby 1.9 string can be viewed as either 1) a sequence of raw bytes, or 2) a sequence of codepoints. Except for maybe regexes, Ruby has no higher level concept of a "character" than a codepoint. Insofar I don't know what the "and is converted to a character before concatenation" means. If we take the sequence of codepoints view, than "str << integer" is simply appending a codepoint. If we take the sequence of bytes view, then "str << integer" is converting the codepoint into a sequence of bytes that correspond to the codepoint in str.encoding and appending that sequence of bytes. -- http://bugs.ruby-lang.org/