From: duerst via ruby-core Date: 2024-01-05T00:25:48+00:00 Subject: [ruby-core:116024] [Ruby master Bug#20148] Sorting not working as expected on Urdu words. Issue #20148 has been updated by duerst (Martin D��rst). Status changed from Open to Rejected The characters involved (shown right-to-left in most environments) are: U+0627 �� ARABIC LETTER ALEF U+00628 �� ARABIC LETTER BEH U+0062A �� ARABIC LETTER TEH U+00679 �� ARABIC LETTER TTEH U+0067E �� ARABIC LETTER PEH The first three characters are widely used in most if not all languages written with Arabic. The last two are more specific; in the code charts (see https://www.unicode.org/charts/PDF/U0600.pdf), TTEH has an annotation of 'Urdu', and PEH has an annotation of 'Persian, Urdu,...'. In the Urdu alphabet (see https://en.wikipedia.org/wiki/Urdu_alphabet), these are the first five letters, where PEH comes directly after BEH, and TTEH comes directly after TEH. The Ruby `sort` method sorts these letters/strings in Unicode codepoint order, the same way it does for all characters/strings. That's because sorting text is language-dependent. As an example, Swedish sorts '��' and '��' after 'z', whereas German sorts them with 'a' and 'o', respectively. It's impossible for `sort` to get it correct for both languages at the same time, and it would require a lot of data. I'm not sure how Arabic-speaking people would sort PEH or TTEH, if they recognize these letters at all. This is also similar to expecting `['a', 'A', 'b', 'B'].sort` to produce `['A', 'a', 'B', 'b']`, when it actually produces `["A", "B", "a", "b"]`. So I'm sorry to have to reject this because it works according to the specification. A feature request to provide language-specific string comparisons (e.g. `string1.<=>(string2, 'ur')` so that this can be used in a block with `sort` may be appropriate, but it will take quite some time to implement this. Alternatively, I suggest you define a hash for the Urdu alphabet order, e.g. ``` {"��" => 1, "��" => 2, "��" => 3, "��" => 4, "��" => 5 }``` (the code above will look strange because of the effects of the Unicode Bidirectional algorithm, but it should be correct), and use that with the `sort_by` method to sort Urdu strings. ---------------------------------------- Bug #20148: Sorting not working as expected on Urdu words. https://bugs.ruby-lang.org/issues/20148#change-106018 * Author: zohaibnadeem13@gmail.com (Zohaib Nadeem) * Status: Rejected * Priority: Normal * ruby -v: 3.1.4 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- I was trying to sort an array of Urdu characters and found out an ambiguity in the result. Here is the script that I am using. ['��', '��', '��', '��', '��'].sort Actual Result: ["��", "��", "��", "��", "��"] Expected Result: ["��", "��", '��', "��", "��"] -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/