From: duerst Date: 2022-03-23T00:25:21+00:00 Subject: [ruby-core:108030] [Ruby master Bug#18641] UTF-16 surrogate pairs Issue #18641 has been updated by duerst (Martin D��rst). noraj (Alexandre ZANNI) wrote in #note-3: > As far as I understand, it's not possible to have a native UTF-16 string it will always be UTF-8 converted to UTF-16 so my only option to write surrogates directly is to use pack? Or write your own custom method, but that's unnecessary. When it comes to encodings for Unicode, Ruby is definitely heavily biased towards UTF-8, because UTF-8 is compatible with ASCII. ---------------------------------------- Bug #18641: UTF-16 surrogate pairs https://bugs.ruby-lang.org/issues/18641#change-96989 * Author: noraj (Alexandre ZANNI) * Status: Rejected * Priority: Normal * ruby -v: ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux] * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- That Ruby triggers an *invalid Unicode codepoint* error while using surrogate pairs in an UTF-8 string is expected, however those codepoints should be valid in an UTF-16 string. It is also expected that unpaired surrogates are invalid however paired surrogates are valid cf. https://unicode.org/faq/utf_bom.html#utf16-7. Version tested: 3.0.3p157, 3.1.0p0 and 3.1.1p18 ``` ruby ��� irb irb(main):001:0> a = ''.force_encoding(Encoding::UTF_16) => "" irb(main):002:0> a += "\uD83D\uDC69".force_encoding(Encoding::UTF_16) /home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/3.1.0/irb/workspace.rb:119:in `eval': (irb):2: invalid Unicode codepoint (SyntaxError) a += "\uD83D\uDC69".force_encoding(Encodi... ^~~~ (irb):2: invalid Unicode codepoint a += "\uD83D\uDC69".force_encoding(Encoding::UT... ^~~~ from /home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `' from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `load' from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `
' ``` Also see [Unicode 14.0 Implementation Guidelines - 5.4 Handling Surrogate Pairs in UTF-16](https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf) -- https://bugs.ruby-lang.org/ Unsubscribe: