ruby-core

For what it's worth, I don't think that the \N{name} escape is 
necessary, even in the standard library.  Unicode names are so long 
(like "ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM") that I think this 
notation would be terribly cumbersome.

Also, \N escapes could easily be approximated (at runtime, instead of 
compile time) with #{} interpolation.  Define a class UN (for Unicode 
Name) and give it a const_missing method to look up (and define 
permanently) codepoints by name.  Then you can write strings like 
"#{UN.COPYRIGHT_SIGN} 2007".  Which isn't much longer or harder than 
"\N{COPYRIGHT SIGN} 2007"

I've done something similar (for codepoints instead of names) with 
const_missing here: http://www.davidflanagan.com/blog/2007_08.html#000136

	David

Yukihiro Matsumoto wrote:
> Hi,
> 
> In message "Re: Encodings of string literals; explicit codepoint escapes?"
>     on Fri, 31 Aug 2007 15:52:51 +0900, David Flanagan <david@davidflanagan.com> writes:
> 
> |I'm excited to see that strings have encodings now!  Thank you for your 
> |Unicode support!  I have a few questions:
> |
> |1) I gather that string literals are given the encoding specified by the 
> |-K option or by the encoding comment at the top of the file.  Do you 
> |plan any changes to the string literal syntax so that encodings can be 
> |specified for individual literals?  Will I be able to include a utf-8 
> |encoding string literal within a file that is otherwise in ASCII? I 
> |don't like Python's u"" syntax, but I'm hoping that you'll provide some 
> |more elegant alternative.
> 
> We will provide "binary" string literals probably via b"" or ""b (not
> fixed yet).  If you want to have string encoded in utf-8 in ASCII
> coded script, you can have utf-8 binary string in binary then specify
> utf-8 later, e.g.
> 
>   # my last name in Japanese
>   m = b"\343\201\276\343\201\244\343\202\202\343\201\250"
>   m.encoding="utf8"
> 
> or possible alternative in the distant future may be:
> 
>   m = "\343\201\276\343\201\244\343\202\202\343\201\250".utf8
>   m = "\343\201\276\343\201\244\343\202\202\343\201\250"u
>   m = "\343\201\276\343\201\244\343\202\202\343\201\250"e:utf8
> 
> |2) This is really part of the same question: will you extend the string 
> |literal syntax to allow the inclusion of arbitrary codepoints in 
> |ASCII-encoded files using some kind of character escape?  I'm accustomed 
> |to Java's \uxxxx escape sequence and would like to see something like 
> |this.  (I don't know enough about SJIS and EUC to know if that would be 
> |relevant to those encodings or not.)
> |
> |Despite my relative ignorance, I suggest something along these lines:
> |
> |\uxxxx: represents Unicode codepoint U+xxxx
> |\Uxxxxxx: represents Unicode codepoint U+xxxxxx
> |\Exxxx: represents EUC codepoint xxxx
> |\Sxxxx: repersents SJIS codepoint xxxx
> |
> |xxxx: is a string of four hex digits.
> 
> We just had a meeting to discuss about issues like this yesterday.
> And the end result was
> 
>   \xXX         -> single byte
>   \uXXXX       -> single Unicode character by codepoint (BMP)
>   \u{XXXXXXXX} -> single Unicode character up to 4 bytes
>   \N{name}     -> single character by name
> 
> But you need to require additional library to get:
> 
>   * characters from Unicode name
>   * Unicode character embedded in non-Unicode encoding strings
> 
> |If a string literal ends with \u, \U, \E, or \S (with no hex digits 
> |following) then the escape specifies the encoding of the string, even 
> |when the string does not contain any characters outside of the ASCII subset.
> 
> This is an interesting idea.  We haven't made a way to specify
> encoding of literals yet.  This might be an input for inspiration.
> 
> 							matz.
>

Thread

Prev Next

In This Thread

Prev Next