ruby-core

Hello Michael,

Many thanks for your proposal. Earlier, when I proposed some
general "encoding policies" to deal with this and similar
problems, the main problem brought up was that it would
interoperate badly with libraries. But looking at your
concrete proposal, it seems to me that overall, the problems
wouldn't actually be that serious.

Therefore, I think we should seriously consider this proposal,
and hopefully implement it before Sept. 25th. In terms of
implementation, I don't think it should be that difficult,
but it may be quite a bit of work to check
Encoding::default_internal in all the affected methods.

In terms of potential problems, I see the following:
- A library sets Encoding::default_internal. That would lead
  to serious problems, and should be clearly advised against
  in the documentation. Libraries either have to be written
  in a general way, or have to document that they only work
  with certain values of Encoding::default_internal
  (this proposal would therefore help you, but not e.g.
   James Gray for the CVS library)
- Encoding::default_internal is set to some dummy or non-ASCII-
  compatible encoding, which may lead to some hickups.
  We may want to make that impossible or advise against.
  (the main use is UTF-8 anyway)
- We should think through various scenarios for output.
  I can't think of any problems just now, I just noticed
  the absence of considerations for output below.

The advantages that I see with this proposal are:
- It gets rid of the bad usability for "r:UTF-16LE:UTF-8"
  (matz, ruby-core:18666)
- It clearly helps "Unicode inside" applications, but is
  not limited to any encoding and may be helpful for other
  encodings as well.
- It fits well within the rest of the naming scheme and the
  overall idea of having several specific encodings to make
  the work of the user easier. If we wouldn't have
  Encoding::default_external, using Ruby with a single
  local encoding would be a big pain. Introducing
  Encoding::default_internal makes using Ruby with
  "Unicode inside" much less of a pain.


At 08:56 08/09/22, Michael Selig wrote:
>On Sun, 21 Sep 2008 02:05:30 +1000, Yukihiro Matsumoto  
><matz@ruby-lang.org> wrote:
>
>> |- How a Japanese programmer would handle the situation of dealing with a
>> |combination of a Japanese non-Unicode compatible character set, and say  
>> a|UTF-8 encoding which included non-ascii characters, and non-Japanese  
>> ones.
>> |ie: Is there a reasonable alternative to encoding both to Unicode &
>> |somehow dealing with the "difficult characters" as special cases?
>>
>> Unicode is getting better each day.  So it now covers almost all
>> day-to-day problems.  Some cellphone problems are covered by using
>> private area.
>
>I infer from this that really Unicode is the only (imperfect) solution for  
>true m17n where we have a mixure of completely different character sets  
>(eg: Japanese & Arabic)?
>What I think this means is that there is no "one size fits all" solution,  
>unfortunately.

Yes. Unicode fits most of the time, some local encoding fits in many
cases (in particular small scripts), and for some very special jobs,
you may have to use something else (a special encoding such as Mojikyo,
the Unicode private areas, an additional level of markup,...).

>So I have an alternate suggestion. Maybe I should rename this thread  
>"Character encodings - a less radical suggestion" :-)

I just did :-).

Regards,    Martin.

>Ruby already has "Encoding::default_external", so why not also have  
>"default_internal"? This option would either be left unset (or NIL I  
>guess) or set to an encoding, likely to be UTF-8 in practice, but maybe  
>there would be a use for it to choose say one of the Japanese encodings if  
>you have a variety of Japanese encodings to handle.
>
>When "default_internal" is nil, Ruby will work as it does now:
>- Ruby libraries such as I/O & network libraries will by default return  
>character data in the external encoding
>- No transcoding will take place unless specifically requested by the Ruby  
>program
>- The Ruby program is responsible for ensuring that the encodings are what  
>it expects, that strings passed to & from Ruby libraries are in the  
>encoding the library expects, and that "Encoding Compatibility Errors"  
>will occur if it is not careful etc.
>
>When "default_internal" is set to an encoding "E":
>- Ruby libraries such as I/O & networking libraries will by default  
>transcode to/from internal encoding E (unless specifically overridden by  
>an option to the class)
>- A Ruby program can then be confident that all strings it handles will be  
>in encoding E, so it doesn't have to worry about encoding compatibility.  
>For example it can be sure that if "s" is "abc" then "s == 'abc'" is true,  
>no matter where the string "s" originated from.
>- Assuming that E is an "ascii-compatible" encoding, the Ruby programmer  
>doesn't have to face issues like "The value is #{val}" substitution  
>failing because "val" is non-ascii compatible.
>- The "downside" as pointed out by a number of people is that not all  
>characters may be transcoded cleanly or even be supported (driving without  
>a seat-belt? :-)), but then programs requiring this level of control  
>should probably not use this feature.
>
>Consequences of this suggestion:
>- Don't have to change the current implementation of encodings, String or  
>Regexp
>- Avoids "automagical transcoding" within String & Regexp methods
>- Responsibility of implementing "default_internal" lies with a certain  
>set of Ruby libraries like IO & networking
>
>Hope this makes sense.
>Mike
>
>
>


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Thread

Prev Next

In This Thread

Prev Next