[#35446] [Ruby 1.9 - Bug #4477][Open] Kernel:exec and backtick (`) don't work for certain system commands — Joachim Wuttke <j.wuttke@...>

10 messages 2011/03/07

[#35476] [Ruby 1.9 - Bug #4489][Open] [PATCH] Encodings with /-(unix|dos|mac)\Z/ — "James M. Lawrence" <quixoticsycophant@...>

20 messages 2011/03/10

[#35552] [Ruby 1.9 - Feature #4523][Open] Kernel#require to return the path of the loaded file — Alex Young <alex@...>

14 messages 2011/03/24

[#35565] [Ruby 1.9 - Feature #4531][Open] [PATCH 0/7] use poll() instead of select() in certain cases — Eric Wong <normalperson@...>

33 messages 2011/03/28

[#35566] [Ruby 1.9 - Feature #4532][Open] [PATCH] add IO#pread and IO#pwrite methods — Eric Wong <normalperson@...>

12 messages 2011/03/28

[#35586] [Ruby 1.9 - Feature #4538][Open] [PATCH (cleanup)] avoid unnecessary select() calls before doing I/O — Eric Wong <normalperson@...>

9 messages 2011/03/29

[ruby-core:35525] Re: [Feature #2350](Rejected) Unicode specific functionality on String in 1.9

From: Nikolai Weibull <now@...>
Date: 2011-03-18 12:52:27 UTC
List: ruby-core #35525
On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote:
> The problem is that the definition of #upcase doesn't only depend on the
> encoding used, but also the language of the encoded text. For instance, i=
f
> you're writing in Turkish, you would expect "i".upcase to return a dotted
> uppcase I:=C2=A0http://www.i18nguy.com/unicode/turkish-i18n.html

I know.  The same goes for =E2=80=98i=E2=80=99 in Lithuanian.

> Doing this properly is *really* hard and needs to have a lot of=C2=A0flex=
ibility,
> especially when it comes to non-Western languages.

This is simply not true.  Unicode defines how to deal with case
conversions.  I=E2=80=99m not saying that the Unicode standard is infallibl=
e,
but we can at least adhere to it.  I=E2=80=99m not saying that Unicode is t=
he
only encoding that we should care about, but if we support the Unicode
transfer formats, why not support other interesting parts of the
standard?

> It's far easier for everyone that the built-in #upcase is
> simple and fast and you'll have to be explicit about any
> other I18n stuff IMO.

Easy, perhaps, but hardly useful.

My point is that the current #upcase (and similar methods) is
basically useless for anything other than ASCII.  I was looking for an
actual solution to this problem.  I have a library
(character-encodings) that does support these conversions, based on
locale and the Unicode character database (UCD).  How do we make it
easy for the user to deal with m18n?  I mean, if I say

# -*- coding: utf-8 -*-

puts "=C3=A4bc".upcase

I expect this to do the right thing for Unicode under the current locale.

As Unicode defines how to deal with case conversions, if I tell Ruby
that =E2=80=9Cthis String is encoded as UTF-8=E2=80=9D (or, in this case, =
=E2=80=9Cstrings in
this file are encoded as UTF-8=E2=80=9D), I expect Ruby to respond =E2=80=
=9COK, I=E2=80=99ll
use the Unicode rules that govern methods like #upcase for that
String=E2=80=9D.

The UCD requires a lot of memory, so I suggested that a library, such
as character-encodings, should be able to seamlessly add this kind of
behavior without requiring the user to write "=C3=A4bc".unicodify.upcase,
if the UCD can=E2=80=99t be included in standard Ruby runtime.

But, come to think of it, doesn=E2=80=99t Oniguruma need most of the UCD
information, so isn=E2=80=99t most of it already included in the Ruby runti=
me?
 Adding casing information perhaps wouldn=E2=80=99t require much additional
space.

If this isn=E2=80=99t of interest, then I=E2=80=99m still looking for a way=
 to
override #upcase for Strings that use the UTF-8 encoding without
resorting to alias_method or extend (as shown earlier in this
discussion).  This seems impossible to do at the moment, as Encoding
is a completely opaque object.

In This Thread