From: Michael Selig Date: 2008-10-24T16:48:04+09:00 Subject: [ruby-core:19473] Re: Default source encoding (Was: [Bug #680] csv.rb: CSV.parse is toolate when encoding is mismatch) Hi, I am not sure I understand your argument about not defaulting the source encoding. The problem I am trying to solve is the compatibility of string literals in your source and strings from other sources. "default_internal" was introduced to try to make all strings the same encoding to avoid incompatibilities. But at the moment string literals seem to default to the source encoding or to UTF-8 if oit is not set (please correct me if I am wrong). What I was suggesting was a way to make string literals be compatible. This normally isn't a problem if: a) All string literals are 7 bit ASCII, or b) The source encoding matches "default_internal" If the source encoding of a program containing non-ascii string literals is set different from default_internal, you are asking for trouble, and would defeat the purpose of default_internal. Therefore to prevent the programmer from having to remember to specify both, it makes sense to me that the source encoding should default to default_internal. I think this is important. Now when default_internal is not set, we have some different issues. If the source encoding defaults to the locale, this may cause different behaviour and confusion if the script is run in different locales. However, "default_external" already defaults to the locale, and with "default_internal" not set, it means that strings read from files are going to be in the locale's encoding anyhow. So that possibly means encoding compatibility issues between data read from file and string literals. Either way, there are possible problems. Possibly the better solution when default_internal is not set is to default the source encoding to "default_external" (which is in turn defaulted to the locale). (By the way, I am not talking about libraries here. As I have stressed previously, libraries should be carefully written to either use ASCII string literals only, or to make sure that it transcodes them properly.) The only other (and perhaps better) solution I can think of is to separate the notion of "source encoding" and "string literal encoding". Then you can have the source encoding set to anything, but always force non-ascii string literals to the "string literal encoding", which can default to Encoding.default_internal || Encoding.default_external. But perhaps this is going over-board with too many different encoding settings. Finally, are you suggesting that "-e" should perform differently to a single-line ruby script? That seems non-intuitive to me. Cheers, Mike On Fri, 24 Oct 2008 17:52:17 +1100, Martin Duerst wrote: > A default for the source encoding has been discussed quite a long > time ago (in some Japanese meetings or on ruby-dev, I don't remember), > and the conclusion was that the source encoding has to be given > (with a majic comment) in the file itself (unless the file is all ascii). > > The reason for this is that the source encoding is a property of the > source, and nothing else. On very simple scripts, it might occasionally > be slightly easier if it were the same as default_external or > default_internal, but this is only the case as long as you stay > in exactly the same environment, and don't move the script. > But scripts grow and move, so it's better to get the settings > right at the start. > > However, as far as I remember, the idea was that for -e, > default_external should be used, because that's what one > is using in a shell. I'm not sure why this doesn't work below. > (assuming Takeyuki is working in a Shift_JIS environment, > which isn't completely sure). > > Regards, Martin. > > > At 12:12 08/10/24, Michael Selig wrote: >> Hi, >> >> This bug actually brings up an interesting issue - should the source >> encoding default to something other than UTF-8 (ie: if it is not >> specified >> in the "magic comment")? >> >> Perhaps it should default to the encoding specified by the user's >> locale? >> Or perhaps it should default to the value of "default_internal" if it is >> set? Or even default_external? >> >> I suggest that it should default to "default_internal" if that is set, >> and >> then to the locale encoding if not. >> >> What do others think? >> Having it default to the locale in this case would probably avoid the >> encoding mismatch entirely (and the resulting confusion). >> >> Cheers >> Mike >> >> On Fri, 24 Oct 2008 11:58:33 +1100, Takeyuki Fujioka >> wrote: >> >>> Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch >>> http://redmine.ruby-lang.org/issues/show/680 >>> >>> Author: Takeyuki Fujioka >>> Status: Open, Priority: Normal >>> Category: lib, Target version: 1.9.x >>> >>> I think this result is true, but encoding mismatch raise is too late. >>> >>> see: >>> % time ruby19 -rcsv -e >>> 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000).force_encoding("shift_jis"))' >>> ruby19 -rcsv -e 0.30s user 0.02s system 96% cpu 0.330 total >>> >>> % time ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000))' >>> /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in `=~': broken UTF-8 >>> string (ArgumentError) >>> from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in >>> `init_separators' >>> from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1563:in >>> `initialize' >>> from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in `new' >>> from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in `parse' >>> from -e:1:in `
' >>> ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000))' 1.55s user >>> 2.57s system 90% cpu 4.530 total >>> >>> >>> ---------------------------------------- >>> http://redmine.ruby-lang.org >> >> >> > > > #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University > #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp >