From: daniel@...42.com Date: 2021-02-01T15:25:04+00:00 Subject: [ruby-core:102361] [Ruby master Bug#17594] Sort order of UTF-16LE is based on binary representation instead of codepoints Issue #17594 has been updated by Dan0042 (Daniel DeLorme). I agree that real high-quality string sorting requires a specialized library, but there's no need to aim so high here. Perfect is the enemy of good. I think it would be good to have a minimum level of consistency between Unicode encodings. Or maybe per-codepoint ordering could be only for casecmp? Since it already has a different order from case-sensitive sort anyway... ---------------------------------------- Bug #17594: Sort order of UTF-16LE is based on binary representation instead of codepoints https://bugs.ruby-lang.org/issues/17594#change-90222 * Author: Dan0042 (Daniel DeLorme) * Status: Open * Priority: Normal * Backport: 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN ---------------------------------------- I just discovered that string sorting is always based on bytes, so the order of UTF-16LE strings will give some peculiar results: ```ruby BE, LE = 'UTF-16BE', 'UTF-16LE' str = [*0..0x4ff].pack('U*').scan(/\p{Ll}/).join puts str.encode(BE).chars.sort.first(50).join.encode('UTF-8') #abcdefghijklmnopqrstuvwxyz������������������������������������������������ puts str.encode(LE).chars.sort.first(50).join.encode('UTF-8') #���������������������������������������������������������������������������������������������������� 'a'.encode(BE) < '��'.encode(BE) #=> true 'a'.encode(LE) < '��'.encode(LE) #=> false ``` Is this supposed to be correct? I mean, I somewhat understand the idea of just sorting by bytes, but I find the above output to be remarkably nonsensical. A similar/related issue was found and fixed in #8653, so there's precedent for considering codepoints instead of bytes. The reason I'm asking is because I was working on some optimizations for `String#casecmp` (https://github.com/ruby/ruby/pull/4133) which, as a side-effect, sort by codepoint for UTF-16LE. And that resulted in a different order for `<=>` vs `casecmp`, and thus some tests broke. But I think sorting by codepoint would be better in this case. -- https://bugs.ruby-lang.org/ Unsubscribe: