From: duerst Date: 2022-03-18T00:24:22+00:00 Subject: [ruby-core:107962] [Ruby master Bug#18641] UTF-16 surrogate pairs Issue #18641 has been updated by duerst (Martin D��rst). Status changed from Open to Rejected `"\uD83D\uDC69"` tries to create an UTF-8 string with surrogates. In UTF-8, surrogates are not allowed, and therefore you get an error. Adding `.force_encoding(Encoding::UTF_16)` does not change any of this, the error has already happened. It is also conceptually wrong, because it would label a sequence of UTF-8 bytes as UTF-16, which would give very strange results. If you want the 'woman' emoji in UTF-16, then here are some choices: ``` "\u{1F469}".encode('UTF-16') # but this will prepend \uFEFF "����".encode('UTF-16') # but this will prepend \uFEFF [0xD83D, 0xDC69]..pack('S>*').force_encoding('UTF-16') ``` If it's something else that you want, please tell us what you want. Also, please note that the above worked on two of my systems, but may not work on your system, because it depends on the endianness of UTF-16 (whether it is actually UTF-16BE or UTF-16LE). ---------------------------------------- Bug #18641: UTF-16 surrogate pairs https://bugs.ruby-lang.org/issues/18641#change-96911 * Author: noraj (Alexandre ZANNI) * Status: Rejected * Priority: Normal * ruby -v: ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux] * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- That Ruby triggers an *invalid Unicode codepoint* error while using surrogate pairs in an UTF-8 string is expected, however those codepoints should be valid in an UTF-16 string. It is also expected that unpaired surrogates are invalid however paired surrogates are valid cf. https://unicode.org/faq/utf_bom.html#utf16-7. Version tested: 3.0.3p157, 3.1.0p0 and 3.1.1p18 ``` ruby ��� irb irb(main):001:0> a = ''.force_encoding(Encoding::UTF_16) => "" irb(main):002:0> a += "\uD83D\uDC69".force_encoding(Encoding::UTF_16) /home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/3.1.0/irb/workspace.rb:119:in `eval': (irb):2: invalid Unicode codepoint (SyntaxError) a += "\uD83D\uDC69".force_encoding(Encodi... ^~~~ (irb):2: invalid Unicode codepoint a += "\uD83D\uDC69".force_encoding(Encoding::UT... ^~~~ from /home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `' from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `load' from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `
' ``` Also see [Unicode 14.0 Implementation Guidelines - 5.4 Handling Surrogate Pairs in UTF-16](https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf) -- https://bugs.ruby-lang.org/ Unsubscribe: