[#12372] Release compatibility/train — Prashant Srinivasan <Prashant.Srinivasan@...>

Hello all,

28 messages 2007/10/03
[#12373] Re: Release compatibility/train — Yukihiro Matsumoto <matz@...> 2007/10/03

Hi,

[#12374] Re: Release compatibility/train — David Flanagan <david@...> 2007/10/03

Yukihiro Matsumoto wrote:

[#12376] Re: Release compatibility/train — Prashant Srinivasan <Prashant.Srinivasan@...> 2007/10/03

[#12377] Re: Release compatibility/train — Yukihiro Matsumoto <matz@...> 2007/10/03

Hi,

[#12382] Re: Release compatibility/train — Charles Oliver Nutter <charles.nutter@...> 2007/10/03

Yukihiro Matsumoto wrote:

[#12385] Re: Release compatibility/train — Yukihiro Matsumoto <matz@...> 2007/10/03

Hi,

[#12388] Re: Release compatibility/train — Charles Oliver Nutter <charles.nutter@...> 2007/10/03

Yukihiro Matsumoto wrote:

[#12389] Re: Release compatibility/train — Yukihiro Matsumoto <matz@...> 2007/10/03

Hi,

[#12406] Re: Release compatibility/train — "David A. Black" <dblack@...> 2007/10/03

Hi --

[#12383] Include Rake in Ruby 1.9 — "NAKAMURA, Hiroshi" <nakahiro@...>

-----BEGIN PGP SIGNED MESSAGE-----

20 messages 2007/10/03

[#12539] Ordered Hashes in 1.9? — Michael Neumann <mneumann@...>

Hi all,

17 messages 2007/10/08
[#12542] Re: Ordered Hashes in 1.9? — Yukihiro Matsumoto <matz@...> 2007/10/08

Hi,

[#12681] Unicode: Progress? — murphy <murphy@...>

Hello!

17 messages 2007/10/15

[#12693] retry: revised 1.9 http patch — Hugh Sasse <hgs@...>

I'm reposting this because I've had little response to this version

11 messages 2007/10/15

[#12697] Range.first is incompatible with Enumerable.first — David Flanagan <david@...>

The new Enumerable.first method is a generalization of Array.first to

11 messages 2007/10/16

[#12754] Improving 'syntax error, unexpected $end, expecting kEND'? — Hugh Sasse <hgs@...>

I've had a look at this, but can't see how to do it: When I get

17 messages 2007/10/18
[#12886] Re: Improving 'syntax error, unexpected $end, expecting kEND'? — David Flanagan <david@...> 2007/10/23

The patch below changes this message to:

[#12758] Encoding::primary_encoding — David Flanagan <david@...>

Hi,

25 messages 2007/10/18
[#12763] Re: Encoding::primary_encoding — Nobuyoshi Nakada <nobu@...> 2007/10/19

Hi,

[#12802] Re: Encoding::primary_encoding — Wolfgang N疆asi-Donner <ed.odanow@...> 2007/10/21

Nobuyoshi Nakada schrieb:

[#12803] Re: Encoding::primary_encoding — Nobuyoshi Nakada <nobu@...> 2007/10/21

Hi,

[#12804] Re: Encoding::primary_encoding — Wolfgang N疆asi-Donner <ed.odanow@...> 2007/10/21

Nobuyoshi Nakada schrieb:

[#12808] Re: Encoding::primary_encoding — Nobuyoshi Nakada <nobu@...> 2007/10/22

Hi,

[#12818] Re: Encoding::primary_encoding — Wolfgang N疆asi-Donner <ed.odanow@...> 2007/10/22

Nobuyoshi Nakada schrieb:

[#12820] Re: Encoding::primary_encoding — "Michal Suchanek" <hramrach@...> 2007/10/22

T24gMjIvMTAvMjAwNywgV29sZmdhbmcgTsOhZGFzaS1Eb25uZXIgPGVkLm9kYW5vd0B3b25hZG8u

[#12823] Re: Encoding::primary_encoding — Wolfgang Nádasi-Donner <ed.odanow@...> 2007/10/22

Michal Suchanek schrieb:

[#12824] Re: Encoding::primary_encoding — Nobuyoshi Nakada <nobu@...> 2007/10/22

Hi,

[#12767] \u escapes in string literals: proof of concept implementation — David Flanagan <david@...>

Back at the end of August, Matz wrote (see

45 messages 2007/10/19
[#12769] Re: \u escapes in string literals: proof of concept implementation — "Nobuyoshi Nakada" <nobu@...> 2007/10/19

Hi,

[#12782] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/20

Nobuyoshi Nakada wrote:

[#12831] Re: \u escapes in string literals: proof of concept implementation — Yukihiro Matsumoto <matz@...> 2007/10/22

Hi,

[#12841] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/22

Yukihiro Matsumoto wrote:

[#12862] Re: \u escapes in string literals: proof of concept implementation — Martin Duerst <duerst@...> 2007/10/23

At 04:19 07/10/23, David Flanagan wrote:

[#12864] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/23

Martin Duerst wrote:

[#12870] Re: \u escapes in string literals: proof of concept implementation — Martin Duerst <duerst@...> 2007/10/23

At 13:10 07/10/23, David Flanagan wrote:

[#12872] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/23

Martin Duerst wrote:

[#12936] Re: \u escapes in string literals: proof of concept implementation — Yukihiro Matsumoto <matz@...> 2007/10/25

Hi,

[#12980] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/26

Yukihiro Matsumoto wrote:

[#13028] Re: \u escapes in string literals: proof of concept implementation — Nobuyoshi Nakada <nobu@...> 2007/10/29

Hi,

[#13032] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/29

Nobuyoshi Nakada wrote:

[#13034] Re: \u escapes in string literals: proof of concept implementation — Nobuyoshi Nakada <nobu@...> 2007/10/29

Hi,

[#13082] Re: \u escapes in string literals: proof of concept implementation — Martin Duerst <duerst@...> 2007/10/30

At 16:46 07/10/29, Nobuyoshi Nakada wrote:

[#13231] Re: \u escapes in string literals: proof of concept implementation — Nobuyoshi Nakada <nobu@...> 2007/11/06

Hi,

[#13234] Re: \u escapes in string literals: proof of concept implementation — Martin Duerst <duerst@...> 2007/11/06

At 11:29 07/11/06, Nobuyoshi Nakada wrote:

[#12825] clarification of ruby libraries installation paths? — Lucas Nussbaum <lucas@...>

Hi,

53 messages 2007/10/22
[#12830] Re: clarification of ruby libraries installation paths? — Ben Bleything <ben@...> 2007/10/22

On Mon, Oct 22, 2007, Lucas Nussbaum wrote:

[#12833] Re: clarification of ruby libraries installation paths? — Lucas Nussbaum <lucas@...> 2007/10/22

On 23/10/07 at 00:13 +0900, Ben Bleything wrote:

[#12835] Re: clarification of ruby libraries installation paths? — "Austin Ziegler" <halostatue@...> 2007/10/22

On 10/22/07, Lucas Nussbaum <lucas@lucas-nussbaum.net> wrote:

[#12836] Re: clarification of ruby libraries installation paths? — Lucas Nussbaum <lucas@...> 2007/10/22

On 23/10/07 at 01:55 +0900, Austin Ziegler wrote:

[#12888] Re: clarification of ruby libraries installation paths? — Gonzalo Garramu <ggarra@...> 2007/10/23

Lucas Nussbaum wrote:

[#12894] Re: clarification of ruby libraries installation paths? — Lucas Nussbaum <lucas@...> 2007/10/24

On 24/10/07 at 05:14 +0900, Gonzalo Garramu wrote:

[#13057] Re: clarification of ruby libraries installation paths? — Gonzalo Garramu <ggarra@...> 2007/10/29

Lucas Nussbaum wrote:

[#13058] Re: clarification of ruby libraries installation paths? — Lucas Nussbaum <lucas@...> 2007/10/29

On 30/10/07 at 07:28 +0900, Gonzalo Garramu wrote:

[#12848] Re: clarification of ruby libraries installation paths? — Sam Roberts <sroberts@...> 2007/10/22

On Tue, Oct 23, 2007 at 01:55:29AM +0900, Austin Ziegler wrote:

[#12855] Re: clarification of ruby libraries installation paths? — "Austin Ziegler" <halostatue@...> 2007/10/23

On 10/22/07, Sam Roberts <sroberts@uniserve.com> wrote:

[#13016] Re: clarification of ruby libraries installation paths? — bob@... (Bob Proulx) 2007/10/28

Austin Ziegler wrote:

[#13029] Re: clarification of ruby libraries installation paths? — "Austin Ziegler" <halostatue@...> 2007/10/29

On 10/28/07, Bob Proulx <bob@proulx.com> wrote:

[#13054] Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — Lucas Nussbaum <lucas@...> 2007/10/29

Austin,

[#13055] Re: Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — "Luis Lavena" <luislavena@...> 2007/10/29

On 10/29/07, Lucas Nussbaum <lucas@lucas-nussbaum.net> wrote:

[#13064] Re: Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — "Austin Ziegler" <halostatue@...> 2007/10/30

On 10/29/07, Luis Lavena <luislavena@gmail.com> wrote:

[#13066] Re: Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — "Luis Lavena" <luislavena@...> 2007/10/30

On 10/30/07, Austin Ziegler <halostatue@gmail.com> wrote:

[#13094] Re: Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — "Rick Bradley" <rick@...> 2007/10/30

Do we think that maybe, just maybe, things went off the rails when the

[#13095] Re: Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — "Luis Lavena" <luislavena@...> 2007/10/30

On 10/30/07, Rick Bradley <rick@rickbradley.com> wrote:

[#12900] Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Wolfgang Nádasi-Donner <ed.odanow@...>

Dear Ruby 1.9 architects, developers, and testers!

31 messages 2007/10/24
[#12905] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Yukihiro Matsumoto <matz@...> 2007/10/24

Hi,

[#12907] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Wolfgang Nádasi-Donner <ed.odanow@...> 2007/10/24

Yukihiro Matsumoto schrieb:

[#12909] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Yukihiro Matsumoto <matz@...> 2007/10/24

Hi,

[#12940] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Wolfgang Nádasi-Donner <ed.odanow@...> 2007/10/25
[#12942] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Wolfgang Nádasi-Donner <ed.odanow@...> 2007/10/25

I have a (hopefully) final question before testing all

[#12948] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Nobuyoshi Nakada <nobu@...> 2007/10/26

Hi,

[#12951] Fluent programming in Ruby — David Flanagan <david@...>

From the ChangeLog:

16 messages 2007/10/26

[#12996] General hash keys for colon notation — murphy <murphy@...>

Dear language designer(s) and parser wizards,

16 messages 2007/10/28

[#13027] Implementation of "guessUTF" method - final questions — Wolfgang Nádasi-Donner <ed.odanow@...>

Dear Ruby designers, developers, and testers!

22 messages 2007/10/29

[#13069] new Enumerable.butfirst method — David Flanagan <david@...>

Matz,

17 messages 2007/10/30

Re: \u escapes in string literals: proof of concept implementation

From: David Flanagan <david@...>
Date: 2007-10-20 06:57:44 UTC
List: ruby-core #12784
This is the third version of my patch for \u escapes.  It is stronger 
than the first and cleaner than the second.  I'm more confident about 
this one.  If \u escapes are still desired for 1.9 (and I hope they are) 
I think this patch will be helpful.  Someone with more experience with 
parse.y needs to look it over carefully, but I don't think it is a 
complete hack, either.

The patch includes two different sets of code for converting codepoints 
to UTF-8.  The shorter one relies on enc/utf-8.c  The longer one does 
the conversion explicitly and is probably a little faster.  But I've 
commented it out in favor of not duplicating that conversion code.

The patch is attached.  This is some interesting test code that you can 
run if you apply the patch:

# \u escapes work in these forms
puts "\ubbbb"
puts %Q{\ubbbb}
puts %W{\ubbbb}
puts <<EOS
\ubbbb
EOS

# \u escapes don't work in these forms
puts '\ubbbb'
puts %q{\ubbbb}
puts %w{\ubbbb}
puts <<'EOS'
\ubbbb
EOS

# \u escapes work in regexps
puts /\ubbbb/

# \u escapes in regexps are handled by the lexer, but
# all other regexp escapes are handled by regexp engine
# This leads to possibly confusing behavior, since a \u005c
# is converted by the lexer to \, and the regexp engine can
# then interpret it as an esscape
puts /\u{5c}(/    # match a single open parenthesis

# Here is the other form of \u escape
puts "\u{41}"       # Letter A
puts "\u{A0A}"      # Some greek thing?
puts "\u{10FFFF}"   # Largest Unicode codepoint

# Encoding stuff.  Any \u escapes for codepoints >= 128
# always force utf-8 encoding.
puts "\u0079".encoding   # ASCII, regardless of -K
puts "\u0080".encoding   # UTF-8, regardless of -K option
puts "\x79".encoding     # ASCII, regardless of -K
puts "\x80".encoding     # encoding depends on -K

Attachments (1)

unicode_patch4 (6.33 KB, text/x-diff)
Index: parse.y
===================================================================
--- parse.y	(revision 13739)
+++ parse.y	(working copy)
@@ -237,6 +237,8 @@
     int has_shebang;
     int parser_ruby_sourceline;	/* current line no. */
     rb_encoding *enc;
+    rb_encoding *ascii;
+    rb_encoding *utf8;
 
 #ifndef RIPPER
     /* Ruby core only */
@@ -264,6 +266,7 @@
 #define STR_NEW0() rb_enc_str_new(0,0,rb_enc_from_index(0))
 #define STR_NEW2(p) rb_enc_str_new((p),strlen(p),parser->enc)
 #define STR_NEW3(p,n,m) parser_str_new((p),(n),STR_ENC(!ENC_SINGLE(m)),(m))
+#define STR_NEW4(p,n,m,u) parser_str_new((p),(n),(u)?parser->utf8:(ENC_SINGLE(m)?parser->ascii:parser->enc), (m))
 #define STR_ENC(m) ((m)?parser->enc:rb_enc_from_index(0))
 #define ENC_SINGLE(cr) ((cr)==ENC_CODERANGE_SINGLE)
 #define TOK_INTERN(mb) rb_intern3(tok(), toklen(), STR_ENC(mb))
@@ -4483,7 +4486,8 @@
 # define yylval  (*((YYSTYPE*)(parser->parser_yylval)))
 
 static int parser_regx_options(struct parser_params*);
-static int parser_tokadd_string(struct parser_params*,int,int,int,long*,int*);
+static int parser_tokadd_string(struct parser_params*,int,int,int,long*,
+				int*, int*);
 static int parser_parse_string(struct parser_params*,NODE*);
 static int parser_here_document(struct parser_params*,NODE*);
 
@@ -4494,7 +4498,7 @@
 # define read_escape(m)            parser_read_escape(parser, m)
 # define tokadd_escape(t,m)        parser_tokadd_escape(parser, t, m)
 # define regx_options()            parser_regx_options(parser)
-# define tokadd_string(f,t,p,n,m)  parser_tokadd_string(parser,f,t,p,n,m)
+# define tokadd_string(f,t,p,n,m,u) parser_tokadd_string(parser,f,t,p,n,m,u)
 # define parse_string(n)           parser_parse_string(parser,n)
 # define here_document(n)          parser_here_document(parser,n)
 # define heredoc_identifier()      parser_heredoc_identifier(parser)
@@ -4674,7 +4678,9 @@
 	}
     }
 
-    parser->enc = rb_enc_get(lex_input);
+    parser->enc = rb_enc_get(lex_input);  /* encoding of source file */
+    parser->ascii = rb_enc_from_index(0); /* ASCII/binary */
+    parser->utf8 = rb_enc_find("utf-8");  /* UTF-8 */
     ruby_sourcefile = rb_source_filename(f);
     ruby_sourceline = line - 1;
     parser_prepare(parser);
@@ -5110,6 +5116,75 @@
     return 0;
 }
 
+static void
+parser_tokadd_utf8(struct parser_params *parser, int *mb, int *has_utf8)
+{
+    int numlen, brace, codepoint;
+    brace = nextc();
+    if (brace == '{') {  /* handle \u{...} form */
+	codepoint = scan_hex(lex_p, 6, &numlen);
+	if (numlen == 0)  {
+	    yyerror("Invalid Unicode escape");
+	    return;
+	}
+	if (codepoint > 0x10ffff) {
+	    yyerror("Illegal Unicode codepoint (too large)");
+	    return;
+	}
+	lex_p += numlen;
+	
+	if ((brace = nextc()) != '}') {
+	    pushback(brace);
+	    yyerror("Unterminated Unicode escape");
+	    return;
+	}
+    }
+    else {                /* handle \uxxxx form */
+	pushback(brace);
+	codepoint = scan_hex(lex_p, 4, &numlen);
+	if (numlen < 4) {
+	    yyerror("Invalid Unicode escape");
+	    return;
+	}
+	lex_p += 4;
+    }
+    
+    if (codepoint < 0x80) { /* \u escape encoded ordinary ASCII char */
+	tokadd(codepoint);
+    }
+    else {
+	UChar buf[4];
+	int i, n;
+
+	/* Set flags so that the resulting string has correct encoding */
+	if (mb) *mb = ENC_CODERANGE_MULTI;
+	if (has_utf8) *has_utf8 = 1;
+
+	/* Convert codepoint to UTF-8 bytes */
+	n = rb_enc_mbcput(codepoint, buf, parser->utf8);
+	for(i=0; i < n; i++) tokadd(buf[i]);
+
+#if 0
+	if (codepoint < 0x800) { /* && codepoint >= 0x80 */
+	    tokadd(((codepoint >> 6)&0x1f) | 0xC0);
+	    tokadd((codepoint & 0x3F) | 0x80);
+	}
+	else if (codepoint < 0x10000) {
+	    tokadd(((codepoint >> 12) & 0x0f) | 0xe0);
+	    tokadd(((codepoint >> 6)&0x3f) | 0x80);
+	    tokadd((codepoint & 0x3F) | 0x80);
+	}
+	else {  /* codepoint < 0x110000  */
+	    tokadd(((codepoint >> 18) & 0x07) | 0xf0);
+	    tokadd(((codepoint >> 12) & 0x3f) | 0x80);
+	    tokadd(((codepoint >> 6)&0x3f) | 0x80);
+	    tokadd((codepoint & 0x3F) | 0x80);
+	}
+#endif
+    }
+}
+
+
 static int
 parser_regx_options(struct parser_params *parser)
 {
@@ -5184,7 +5259,8 @@
 
 static int
 parser_tokadd_string(struct parser_params *parser,
-		     int func, int term, int paren, long *nest, int *mb)
+		     int func, int term, int paren, long *nest,
+		     int *mb, int *has_utf8)
 {
     int c;
 
@@ -5219,6 +5295,16 @@
 		if (func & STR_FUNC_ESCAPE) tokadd(c);
 		break;
 
+	      case 'u':
+		if ((func & STR_FUNC_EXPAND) == 0) {
+		    tokadd('\\');
+		    break;
+		}
+		else {
+		    parser_tokadd_utf8(parser, mb, has_utf8);
+		    continue;
+		}
+
 	      default:
 		if (func & STR_FUNC_REGEXP) {
 		    pushback(c);
@@ -5267,7 +5353,7 @@
     int func = quote->nd_func;
     int term = nd_term(quote);
     int paren = nd_paren(quote);
-    int c, space = 0, mb = ENC_CODERANGE_SINGLE;
+    int c, space = 0, mb = ENC_CODERANGE_SINGLE, has_utf8 = 0;
 
     if (func == -1) return tSTRING_END;
     c = nextc();
@@ -5301,7 +5387,8 @@
 	tokadd('#');
     }
     pushback(c);
-    if (tokadd_string(func, term, paren, &quote->nd_nest, &mb) == -1) {
+    if (tokadd_string(func, term, paren, &quote->nd_nest,
+		      &mb, &has_utf8) == -1) {
 	if (func & STR_FUNC_REGEXP) {
 	    ruby_sourceline = nd_line(quote);
 	    compile_error(PARSER_ARG "unterminated regexp meets end of file");
@@ -5315,7 +5402,7 @@
     }
 
     tokfix();
-    set_yylval_str(STR_NEW3(tok(), toklen(), mb));
+    set_yylval_str(STR_NEW4(tok(), toklen(), mb, has_utf8));
     return tSTRING_CONTENT;
 }
 
@@ -5479,6 +5566,7 @@
     }
     else {
 	int mb = ENC_CODERANGE_SINGLE, *mbp = &mb;
+	int has_utf8 = 0;
 	newtok();
 	if (c == '#') {
 	    switch (c = nextc()) {
@@ -5493,16 +5581,18 @@
 	}
 	do {
 	    pushback(c);
-	    if ((c = tokadd_string(func, '\n', 0, NULL, mbp)) == -1) goto error;
+	    if ((c = tokadd_string(func, '\n', 0, NULL,
+				   mbp, &has_utf8)) == -1)
+		goto error;
 	    if (c != '\n') {
-		set_yylval_str(STR_NEW3(tok(), toklen(), mb));
+		set_yylval_str(STR_NEW4(tok(), toklen(), mb, has_utf8));
 		return tSTRING_CONTENT;
 	    }
 	    tokadd(nextc());
 	    if (mbp && mb == ENC_CODERANGE_UNKNOWN) mbp = 0;
 	    if ((c = nextc()) == -1) goto error;
 	} while (!whole_match_p(eos, len, indent));
-	str = STR_NEW3(tok(), toklen(), mb);
+	str = STR_NEW4(tok(), toklen(), mb, has_utf8);
     }
     heredoc_restore(lex_strterm);
     lex_strterm = NEW_STRTERM(-1, 0, 0);

In This Thread