[#115884] Windows Ruby 3.2.2: Non-English character added to Windows Registry String Value — Jay Mav via ruby-core <ruby-core@...>
Hello,
3 messages
2023/12/24
[ruby-core:115584] [Ruby master Bug#20025] Parsing identifiers/constants is case-folding dependent
From:
duerst via ruby-core <ruby-core@...>
Date:
2023-12-04 08:55:58 UTC
List:
ruby-core #115584
Issue #20025 has been updated by duerst (Martin D=FCrst).
@nobu (Nobuyoshi Nakada) wrote in #note-3:
> The reason is that micro sign is folded to small Mu in Windows-1253.
The micro sign is indeed folded to small mu in windows-1253. The reason is =
(most probably) that it is also folded this way in Unicode; see https://www=
.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt. The actual data for thi=
s is the `'\354'` at https://github.com/ruby/ruby/blob/85bc80a51be0ceedcc57=
e7b6b779e6f8f885859e/enc/windows_1253.c#L67.
P.S.: I really feel like proposing to change all these octal constants to h=
exadecimal, in order to bring them into the current century and align them =
with all the other data surrounding character encoding. But I guess that sh=
ould be a separate issue.
----------------------------------------
Bug #20025: Parsing identifiers/constants is case-folding dependent
https://bugs.ruby-lang.org/issues/20025#change-105516
* Author: kddnewton (Kevin Newton)
* Status: Closed
* Priority: Normal
* Backport: 3.0: REQUIRED, 3.1: REQUIRED, 3.2: REQUIRED
----------------------------------------
When CRuby parses identifiers, it is encoding-dependent. Once the identifie=
r is found, it determines if it starts with a uppercase or lowercase codepo=
int. This determines if the identifier is a constant or not.
The function is charge of this is `rb_sym_constant_char_p`. For non-unicode=
encodings where the leading byte has the top-bit set, this relies on onigm=
o's `mbc_case_fold` to determine if it is a constant or not (as opposed to =
`is_code_ctype`).
This works for almost every single codepoint in every encoding, but has one=
very weird edge case. In the Windows-1253 encoding for the 0xB5 byte, it's=
the micro sign. The micro sign, when case folded, becomes the uppercase mu=
character, and then the lowercase mu character, or 0xEC. This means that e=
ven though 0xB5 reports itself as being a lowercase codepoint, it gets pars=
ed as a constant. This example might make this more clear:
``` ruby
class Context < BasicObject
def method_missing(name, *) =3D :identifier
def self.const_missing(name) =3D :constant
end
encoding =3D Encoding::Windows_1253
character =3D 0xB5.chr(encoding)
source =3D "# encoding: #{encoding.name}\n#{character}\n"
result =3D Context.new.instance_eval(source)
puts "#{encoding.name} encoding of 0x#{character.ord.to_s(16).upcase}"
puts " [[:alpha:]] =3D> #{character.match?(/[[:alpha:]]/)}"
puts " [[:alnum:]] =3D> #{character.match?(/[[:alnum:]]/)}"
puts " [[:upper:]] =3D> #{character.match?(/[[:upper:]]/)}"
puts " [[:lower:]] =3D> #{character.match?(/[[:lower:]]/)}"
puts " parsed as #{result}"
```
this results in the output of:
```
Windows-1253 encoding of 0xB5
[[:alpha:]] =3D> true
[[:alnum:]] =3D> true
[[:upper:]] =3D> false
[[:lower:]] =3D> true
parsed as constant
```
To be clear, I don't think the case-folding is incorrect here (and @duerst =
confirms that it is correct). I believe instead that it is incorrect to use=
case-folding here to determine if a codepoint is uppercase or not.
Note that this only impacts this one codepoint in this one encoding, so I d=
on't believe this is actually a large-scale problem. But I found it surpris=
ing, and think we should change it.
--=20
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-c=
ore.ml.ruby-lang.org/