[#107867] Fwd: [ruby-cvs:91197] 8f59482f5d (master): add some tests for Unicode Version 14.0.0 — Martin J. Dürst <duerst@...>
To everybody taking care of continuous integration:
3 messages
2022/03/13
[#108090] [Ruby master Bug#18666] No rule to make target 'yaml/yaml.h', needed by 'api.o' — duerst <noreply@...>
Issue #18666 has been reported by duerst (Martin D端rst).
7 messages
2022/03/28
[#108117] [Ruby master Feature#18668] Merge `io-nonblock` gems into core — "Eregon (Benoit Daloze)" <noreply@...>
Issue #18668 has been reported by Eregon (Benoit Daloze).
22 messages
2022/03/30
[ruby-core:108030] [Ruby master Bug#18641] UTF-16 surrogate pairs
From:
duerst <noreply@...>
Date:
2022-03-23 00:25:21 UTC
List:
ruby-core #108030
Issue #18641 has been updated by duerst (Martin Dürst).
noraj (Alexandre ZANNI) wrote in #note-3:
> As far as I understand, it's not possible to have a native UTF-16 string it will always be UTF-8 converted to UTF-16 so my only option to write surrogates directly is to use pack?
Or write your own custom method, but that's unnecessary. When it comes to encodings for Unicode, Ruby is definitely heavily biased towards UTF-8, because UTF-8 is compatible with ASCII.
----------------------------------------
Bug #18641: UTF-16 surrogate pairs
https://bugs.ruby-lang.org/issues/18641#change-96989
* Author: noraj (Alexandre ZANNI)
* Status: Rejected
* Priority: Normal
* ruby -v: ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
That Ruby triggers an *invalid Unicode codepoint* error while using surrogate pairs in an UTF-8 string is expected, however those codepoints should be valid in an UTF-16 string.
It is also expected that unpaired surrogates are invalid however paired surrogates are valid cf. https://unicode.org/faq/utf_bom.html#utf16-7.
Version tested: 3.0.3p157, 3.1.0p0 and 3.1.1p18
``` ruby
➜ irb
irb(main):001:0> a = ''.force_encoding(Encoding::UTF_16)
=> ""
irb(main):002:0> a += "\uD83D\uDC69".force_encoding(Encoding::UTF_16)
/home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/3.1.0/irb/workspace.rb:119:in `eval': (irb):2: invalid Unicode codepoint (SyntaxError)
a += "\uD83D\uDC69".force_encoding(Encodi...
^~~~
(irb):2: invalid Unicode codepoint
a += "\uD83D\uDC69".force_encoding(Encoding::UT...
^~~~
from /home/noraj/.asdf/installs/ruby/3.1.0/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `load'
from /home/noraj/.asdf/installs/ruby/3.1.0/bin/irb:25:in `<main>'
```
Also see [Unicode 14.0 Implementation Guidelines - 5.4 Handling Surrogate Pairs in UTF-16](https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf)
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>