ruby-core

Hi,

Thanks for all the replies - I am not an expert on all these encodings,  
and I (obviously mistakenly!) assumed that all other encodings could be  
converted to Unicode.

When I first looked at Ruby 1.9's encoding support I thought "that's neat  
- I think it will solve my m17n problems". However as I got into it I soon  
discovered that it wasn't nearly this easy!

Here is a summary of my issues:

- Non "ASCII-compatible" data is almost impossible to work with. Just take  
a look at what James Gray was proposing to do for CSV.

- When developing standard classes & mixins that could be installed in any  
country, virtually all methods that handle more than 1 string are going to  
have to worry about the possibility of dealing with incompatible  
encodings. This is a major overhead to a programmer - it may not be  
acceptable to let it raise an error.

- Other alternative languages to Ruby which represent all strings as  
Unicode don't have this problem. Although they may not be a 100% solution  
in Japan & China, they would certainly be fine for me to use.

- As my application is under my control, I can make the decision to  
transcode everything to UTF-8 if I want to. I was hoping not to, but I  
think the extra code I would have to write to test encoding compatibility  
would not be worthwhile as it would be in so many places. And yes, I could  
write a

- For people like James who are trying to modify a standard library like  
CSV, which on the surface looks like a simple task, it is really quite  
daunting.

My "ideal" would be that Ruby automatically converted to a common encoding  
rather than raising an Encoding Compatibility Error. And although Unicode  
apparently may not cope with every character on the planet at present, I  
guess it will one day, and it seems to me to be the sensible thing to use  
as the "common encoding" - or UTF-8 to be precise.

That way, in the 99% of cases where the encodings ARE compatible, Ruby  
would work exactly as it does now.

But it also means that I can write methods and not have to worry about  
them blowing up because of encoding incompatibility.

It *does* mean that strings may "magically" be converted to UTF-8, but I  
don't see this as a big deal as long as when they are output they are  
converted back to the necessary encoding (which I think Ruby does with  
files now). If the "magic" conversion is a problem, maybe there should be  
a switch to turn it on & off.
This auto-convert policy should also be used with non-destructive methods  
like String#== etc so the programmer needn't worry whether the same  
character has a different representation on each side of the "==".
The ASCII-8BIT encoding should be reserved as a "special case" and not be  
subject to auto-conversion, because it is going to be mainly used for  
"byte strings".
Yes, there may be a performance overhead doing this. But is this a big  
deal if it only happens in 1% of cases?

Sure there are issues with this, like what to do with text that cannot be  
encoded to Unicode (now that I know it exists!), and also the  
implementation of these suggestions may not be easy, but I think *not*  
doing something about these issues may make the dev community have a  
negative impression of Ruby, which would be a great, great shame.

Cheers
Mike

On Thu, 18 Sep 2008 00:28:03 +1000, Yukihiro Matsumoto  
<matz@ruby-lang.org> wrote:

> Hi,
>
> In message "Re: [ruby-core:18640] Character encodings - a radical  
> suggestion"
>     on Wed, 17 Sep 2008 10:20:13 +0900, "Michael Selig"  
> <michael.selig@fs.com.au> writes:
>
> |So my radical suggestion is this:
> |
> |Remove internal support for non-ASCII encodings completely, and when
> |reading/writing UTF-16 (and UTF-32) files automatically transcode  
> to/from
> |UTF-8.
>
> What happens with non Unicode text under your suggestion?
>
> My conservative suggestion is that:
>
> Put "r:UTF-16BE:UTF-8" for mode when you open an UTF-16 file to read,
> so that your internal strings are all UTF-8 encoding.
>
> |My reasons:
> |
> |- String & Regexp operations should just "work" without the programmer
> |worrying about encoding comaptibility (I think!)
> |- The programmer only has to think about character encodings at the
> |"interfaces" (files, network interfaces) not throughout the program  
> logic
>
> My "suggestion" satisfies above two.
>
> |- To my knowledge UTF-16 & UTF-32 are the only "non-ASCII compatible" as
> |Ruby defines it
>
> As akr stated this is wrong.
>
> |- To my knowledge no one actually uses UTF-16 or UTF-32 as a locale
>
> Yes.
>
> |- I would avoid having to use ugly modes to open a file like
> |"r:UTF-16LE:UTF-8" (very minor)
>
> This is ugly indeed.  We might add more Unicode support in the
> future.  But we are no hurry.
>
> |- Ruby's internal code would be simpler & cleaner and therefore probably
> |faster and easier to maintain
>
> Dropping UTF-{16,32} is not enough.  Unless we abandon non-Unicode
> encoding support altogether, it won't be THAT simple.  And I am not
> going to remove their support.  I use them everyday.
>
> 							matz.

Thread

Prev Next

In This Thread

Prev Next