From: "noraj (Alexandre ZANNI)" Date: 2022-03-19T13:22:29+00:00 Subject: [ruby-core:107989] [Ruby master Bug#18641] UTF-16 surrogate pairs Issue #18641 has been updated by noraj (Alexandre ZANNI). Thank you Martin. I'm actually working on an Unicode study, I was not interested into representing the emoji with it's codepoint but to actually be able to write non-BMP glyph in UTF-16 by using the surrogates. As far as I understand, it's not possible to have a native UTF-16 string it will always be UTF-8 converted to UTF-16 so my only option to write surrogates directly is to use pack? ---------------------------------------- Bug #18641: UTF-16 surrogate pairs https://bugs.ruby-lang.org/issues/18641#change-96942 * Author: noraj (Alexandre ZANNI) * Status: Rejected * Priority: Normal * ruby -v: ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux] * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- That Ruby triggers an *invalid Unicode codepoint* error while using surrogate pairs in an UTF-8 string is expected, however those codepoints should be valid in an UTF-16 string. It is also expected that unpaired surrogates are invalid however paired surrogates are valid cf. https://unicode.org/faq/utf_bom.html#utf16-7. Version tested: 3.0.3p157, 3.1.0p0 and 3.1.1p18 ``` ruby ��� irb irb(main):001:0> a = ''.force_encoding(Encoding::UTF_16) => "" irb(main):002:0> a += "\uD83D\uDC69".force_encoding(Encoding::UTF_16) /home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/3.1.0/irb/workspace.rb:119:in `eval': (irb):2: invalid Unicode codepoint (SyntaxError) a += "\uD83D\uDC69".force_encoding(Encodi... ^~~~ (irb):2: invalid Unicode codepoint a += "\uD83D\uDC69".force_encoding(Encoding::UT... ^~~~ from /home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `' from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `load' from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `
' ``` Also see [Unicode 14.0 Implementation Guidelines - 5.4 Handling Surrogate Pairs in UTF-16](https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf) -- https://bugs.ruby-lang.org/ Unsubscribe: