From: nobu@... Date: 2015-01-22T10:19:44+00:00 Subject: [ruby-dev:48839] [ruby-trunk - Feature #10770] chr and ord behavior for ill-formed byte sequences and surrogate code points Issue #10770 has been updated by Nobuyoshi Nakada. Description updated Masaki Kagaya wrote: > ~~~ruby > str = "a\x80bc" > str.each_char {|c| puts c } > # no error Sounds like a bug of `String#each_char`, but maybe intensional. > The one way of keeping consistency is change `ord` to return substitute code point such as 0xFFFD adopted by `scrub`. Implicit substitution doesn't feel a nice idea to me. > How about remove the restriction? The one example of using surrogate code points is converting a 4-byte character to a pair of 3-byte characters for MySQL/MariaDB's utf8mb3. Primarily, it's a responsibility of those bindings. ~~~ruby str.encode("UTF-16BE").unpack("v*").pack("U*") ~~~ ---------------------------------------- Feature #10770: chr and ord behavior for ill-formed byte sequences and surrogate code points https://bugs.ruby-lang.org/issues/10770#change-51181 * Author: Masaki Kagaya * Status: Open * Priority: Normal * Assignee: ---------------------------------------- `ord` raises error when meeting ill-formed byte sequences, thus the difference of atttiute exists between `each_char` and `each_codepoint`. ~~~ruby str = "a\x80bc" str.each_char {|c| puts c } # no error str.each_codepoint {|c| puts c } # invalid byte sequence in UTF-8 (ArgumentError) ~~~ The one way of keeping consistency is change `ord` to return substitute code point such as 0xFFFD adopted by `scrub`. Another problem about consitency is surrogate code points. Althouh CRuby allows to use surrogate code points in unicode literal, `ord` and `chr` don't allow them. ~~~ruby "\uD800".ord # invalid byte sequence in UTF-8 (ArgumentError) 0xD800.chr('UTF-8') # invalid codepoint 0xD800 in UTF-8 (RangeError) ~~~ How about remove the restriction? The one example of using surrogate code points is converting a 4-byte character to a pair of 3-byte characters for MySQL/MariaDB's utf8mb3. ~~~ruby str = "\u{1F436}" # DOG FACE cp = str.ord if cp > 0x10000 then # http://unicode.org/faq/utf_bom.html#utf16-4 lead = 0xD800 - (0x10000 >> 10) + (cp >> 10) trail = 0xDC00 + (cp & 0x3FF) ret = lead.chr('UTF-8') + trail.chr('UTF-8') end ~~~ -- https://bugs.ruby-lang.org/