From: Michael Selig Date: 2008-10-26T11:25:58+09:00 Subject: [ruby-core:19515] String literal encoding (Was: Default source encoding (Was: [Bug #680] csv.rb: CSV.parse is toolate when encoding is mismatch)) Hi, Sorry, perhaps I have been giving a (bad) solution, rather than stating the problem clearly, so let me try again! I certainly didn't mean to suggest there should be any transcoding of string literals by Ruby's parser. So here are the problems as I see them. They are all to do with the default encoding of string literals, and they are all fairly minor, but I think addressing them has merit: 1) The encoding of string literals constructed with "\x..." is ambiguous. Well not strictly ambiguous, but certainly it can be confusing. The trouble is that a string literal like the example in bug #680 "\x82\xA0,\x82\xA2" can either be used as a "binary" string (ASCII-8BIT) or an encoded character string (intended to be Shift_JIS in this case), but this depends on the source encoding. While technically these are the same data, they are used in quite different ways in practice. Also, as we see in the bug report, it can cause mysterious errors such as "Bad UTF-8 string" because the source encoding was apparently UTF-8 not Shift_JIS (thank you to Martin for pointing this out). Ruby treats strings constucted with "\u..." differently: they are set to UTF-8 no matter what the source encoding. I think this is the correct behaviour - there is no ambiguity. But "\x..." is not treated like this. When the source encoding is not specified (or is US-ASCII), a "\x.." string is set to ASCII-8BIT. Again I think this is the correct behaviour. However if the source encoding is set to anything else, the encoding of the string is set to the source encoding. I think this is the part that is wrong, especially as the resultant string can be "broken", and no warning is given about this by the parser. My preference would be to *always* encode string literals constructed with "\x.." as ASCII-8BIT, ignoring the source encoding. This means that if you really want to use such a literal as an encoded string, you must use "force_encoding". I think this would be much clearer and get rid of the "ambiguity". 2) I find it slightly redundant to have to specify BOTH the default_internal, and the source encoding at the top of an m17n script which contains multibyte string literals, when in all practical cases they should be the same. eg: #! /usr/bin/ruby -E:UTF-8 # encoding: UTF-8 My suggestion for "defaulting" the source encoding was an attempt to avoid having to do this (but probably not a good way!). It isn't a big deal, and I understand the argument that the source encoding is a property of the script. My original suggestion (last month) of a special magic comment was to have a way of specifying BOTH the default_internal and source encoding once, but this idea was rejected. 3) I think there should be some check (warning message?) that the (non ASCII-8BIT) string literals in a library file are compatible with the "default_internal" of the calling program (if it is set). Ideally this check would be done when the "require" is called to flag possible incompatibilities early. Perhaps this check could be based on the library's source encoding? If this were done, most libraries would have to use a source encoding of US-ASCII (or just have no encoding magic comment) *not* UTF-8, so that non-Unicode default_internal's will work. Perhaps Ruby could be smarter, and only flag an error if there actually is an incomaptible string literal in the library? 4) I was surprised at the different source encoding behaviour when using "-e" compared to a script in a file. (Again thank you to Martin for telling me about it) Matz wrote: > -e takes programs from command line shell, which probably yields > strings in locale encoding anyway. But we cannot assume that for > scripts contained in files. Again I understand the sentiment, but for a simple non-m17n, non-ascii ruby script that was likely written with an editor on the same machine or in the same locale, why force it to have an "encoding" magic comment? Also it means that: ruby test.rb may perform differently than: ruby -e "`cat test.rb`" Again potentially confusing, but not a big deal. I hope I have made myself clearer this time! Thanks, Mike.