[#25936] [Bug:1.9] [rubygems] $LOAD_PATH includes bin directory — Nobuyoshi Nakada <nobu@...>

Hi,

10 messages 2009/10/05

[#25943] Disabling tainting — Tony Arcieri <tony@...>

Would it make sense to have a flag passed to the interpreter on startup that

16 messages 2009/10/05

[#26028] [Bug #2189] Math.atanh(1) & Math.atanh(-1) should not raise an error — Marc-Andre Lafortune <redmine@...>

Bug #2189: Math.atanh(1) & Math.atanh(-1) should not raise an error

14 messages 2009/10/10

[#26222] [Bug #2250] IO::for_fd() objects' finalization dangerously closes underlying fds — Mike Pomraning <redmine@...>

Bug #2250: IO::for_fd() objects' finalization dangerously closes underlying fds

11 messages 2009/10/22

[#26244] [Bug #2258] Kernel#require inside rb_require() inside rb_protect() inside SysV context fails — Suraj Kurapati <redmine@...>

Bug #2258: Kernel#require inside rb_require() inside rb_protect() inside SysV context fails

24 messages 2009/10/22

[#26361] [Feature #2294] [PATCH] ruby_bind_stack() to embed Ruby in coroutine — Suraj Kurapati <redmine@...>

Feature #2294: [PATCH] ruby_bind_stack() to embed Ruby in coroutine

42 messages 2009/10/27

[#26371] [Bug #2295] segmentation faults — tomer doron <redmine@...>

Bug #2295: segmentation faults

16 messages 2009/10/27

[ruby-core:26194] [Feature #2034] Consider the ICU Library for Improving and Expanding Unicode Support

From: Perry Smith <redmine@...>
Date: 2009-10-21 01:09:55 UTC
List: ruby-core #26194
Issue #2034 has been updated by Perry Smith.


I will try and answer both of the posts above.

Mostly, you both asked about which encodings.  As absured as this may
sound, I don't know.

When I fetch text from the legacy system, it has a two byte CCSID in
front of it.  I have a table that translates the CCSID to the name of
the encoding.  It is much like:

http://www-01.ibm.com/software/globalization/ccsid/ccsid_registered.jsp

I then translate the text to ICU's internal format which is UTF-16.
Later, I translate it to UTF-8 because that seems to work with
browsers.

I have no idea if ICU does this using tables or what.  My belief is
that it does not go to EUC-JP.  I also do not know but I assume that
many of these are not single byte 

I do know that for Japanese text, usually the page text is encoded
using IBM-939.  Most English is in IBM-037.  But the system I'm
interfacing to is used world wide and I would assume that other code
pages are used.

> Can you explain what you mean by 'envelope around UCS model'?

I think, based upon the reply, that you understand but to restate it:
use ICU everywhere you can but when you are forced to translate, do it
to and from some common (probably third) encoding.  It just seems like
it would be much easier to do that.

> On the file level, this would be similar to having a file with internal 
> change of character encoding. At the very, very early stages of Web 
> internationalization, some people proposed such a model, but the Web 
> went a different way. And so went most if not all text editors, you 
> can't have a file with many different encodings at the same time. Sure 
> file encodings and internal encodings work a bit differently, but it's 
> not a disadvantage if those two models match.

I was imagining doing this only for the internal encodings.  I later
mentioned that translations must be done for external reasons.  I
meant that the translation would be done when going to a file or to
any external data stream.

> rope

I see this as an incorrect name which may be why it has attributes
that we do not want.  To me, rope is made up of many strings.  But I
wanted String made up of many <things> -- e.g. SubStrings.

Strings would still be mutable in my scheme.  It seems plausible that
a data structure could be devised that would yield the Nth character
is the same time as the current implementation.  It may require more
space.

> Regexp

Yes.  I totally forgot about Regexp's.

There is one thing that confused me at the end of Martin's post.  To
me, data never has a language.  Perhaps I'm mistaken.  The data only
have a language when viewed by a user.  As he points out, a sort can
only be properly done when the language of the user is taken into
account.  At least, that is how I would rephrase what he said.

Am I missing a subtlety there?

----------------------------------------
http://redmine.ruby-lang.org/issues/show/2034

----------------------------------------
http://redmine.ruby-lang.org

In This Thread