From: eregontp@... Date: 2019-01-02T12:18:09+00:00 Subject: [ruby-core:90853] [Ruby trunk Bug#15497] Encoding of error messages should not depend on the locale encoding Issue #15497 has been reported by Eregon (Benoit Daloze). ---------------------------------------- Bug #15497: Encoding of error messages should not depend on the locale encoding https://bugs.ruby-lang.org/issues/15497 * Author: Eregon (Benoit Daloze) * Status: Open * Priority: Normal * Assignee: * Target version: * ruby -v: ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux] * Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN ---------------------------------------- This seems to happen mostly for internal errors, as `raise` in Ruby code of course just uses the passed String's encoding for the message. Example: ```ruby name = "��t��" p name.encoding begin Module.new.const_set(name, 1) rescue => e p e p e.message.encoding end ``` When run, it gives: ``` $ LANG=en_US.UTF-8 ruby c.rb #<Encoding:UTF-8> #<NameError: wrong constant name ��t��> #<Encoding:UTF-8> $ LANG=C ruby c.rb #<Encoding:UTF-8> #<NameError: wrong constant name "\u00E9t\u00E9"> #<Encoding:US-ASCII> ``` Depending on the locale encoding, the encoding of the message changes! This seems very unexpected, is inconvenient for testing (e.g., https://github.com/ruby/spec/commit/a6101a6e and any test checking exception messages with non-US-ASCII characters), and does not represent what is in the source code (here it's clearly a valid UTF-8 String). I think for such a case, the encoding of the constant name should be used, i.e., UTF-8. Another way to see it is the message should be built like `"wrong constant name ".force_encoding('us-ascii') + constant_name`. Indeed, if we do build the message manually like that it works as expected: ``` name = "��t��" begin raise "wrong constant name ".force_encoding('US-ASCII') + name rescue => e p e p e.message.encoding end ``` gives ``` $ LANG=en_US.UTF-8 ruby c.rb #<Encoding:UTF-8> #<RuntimeError: wrong constant name ��t��> #<Encoding:UTF-8> $ LANG=C ruby c.rb #<Encoding:UTF-8> #<RuntimeError: wrong constant name \u00E9t\u00E9> #<Encoding:UTF-8> ``` Note that the message still looks different, but that's the effect of `Kernel#p`, because it does not know how to display UTF-8 characters in a US-ASCII terminal. Nevertheless, both messages have the same bytes and encoding, which fixes all 3 problems mentioned above. Setting `Encoding.default_internal` can workaround this but it's a bad workaround as this cannot work reliably in a multithreaded Ruby application, affects many more things than just error messages, and the default behavior should be error messages with a deterministic encoding, just like `raise` in Ruby code. -- https://bugs.ruby-lang.org/ Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe> <http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>