[#12372] Release compatibility/train — Prashant Srinivasan <Prashant.Srinivasan@...>

Hello all,

28 messages 2007/10/03
[#12373] Re: Release compatibility/train — Yukihiro Matsumoto <matz@...> 2007/10/03

Hi,

[#12374] Re: Release compatibility/train — David Flanagan <david@...> 2007/10/03

Yukihiro Matsumoto wrote:

[#12376] Re: Release compatibility/train — Prashant Srinivasan <Prashant.Srinivasan@...> 2007/10/03

[#12377] Re: Release compatibility/train — Yukihiro Matsumoto <matz@...> 2007/10/03

Hi,

[#12382] Re: Release compatibility/train — Charles Oliver Nutter <charles.nutter@...> 2007/10/03

Yukihiro Matsumoto wrote:

[#12385] Re: Release compatibility/train — Yukihiro Matsumoto <matz@...> 2007/10/03

Hi,

[#12388] Re: Release compatibility/train — Charles Oliver Nutter <charles.nutter@...> 2007/10/03

Yukihiro Matsumoto wrote:

[#12389] Re: Release compatibility/train — Yukihiro Matsumoto <matz@...> 2007/10/03

Hi,

[#12406] Re: Release compatibility/train — "David A. Black" <dblack@...> 2007/10/03

Hi --

[#12383] Include Rake in Ruby 1.9 — "NAKAMURA, Hiroshi" <nakahiro@...>

-----BEGIN PGP SIGNED MESSAGE-----

20 messages 2007/10/03

[#12539] Ordered Hashes in 1.9? — Michael Neumann <mneumann@...>

Hi all,

17 messages 2007/10/08
[#12542] Re: Ordered Hashes in 1.9? — Yukihiro Matsumoto <matz@...> 2007/10/08

Hi,

[#12681] Unicode: Progress? — murphy <murphy@...>

Hello!

17 messages 2007/10/15

[#12693] retry: revised 1.9 http patch — Hugh Sasse <hgs@...>

I'm reposting this because I've had little response to this version

11 messages 2007/10/15

[#12697] Range.first is incompatible with Enumerable.first — David Flanagan <david@...>

The new Enumerable.first method is a generalization of Array.first to

11 messages 2007/10/16

[#12754] Improving 'syntax error, unexpected $end, expecting kEND'? — Hugh Sasse <hgs@...>

I've had a look at this, but can't see how to do it: When I get

17 messages 2007/10/18
[#12886] Re: Improving 'syntax error, unexpected $end, expecting kEND'? — David Flanagan <david@...> 2007/10/23

The patch below changes this message to:

[#12758] Encoding::primary_encoding — David Flanagan <david@...>

Hi,

25 messages 2007/10/18
[#12763] Re: Encoding::primary_encoding — Nobuyoshi Nakada <nobu@...> 2007/10/19

Hi,

[#12802] Re: Encoding::primary_encoding — Wolfgang N疆asi-Donner <ed.odanow@...> 2007/10/21

Nobuyoshi Nakada schrieb:

[#12803] Re: Encoding::primary_encoding — Nobuyoshi Nakada <nobu@...> 2007/10/21

Hi,

[#12804] Re: Encoding::primary_encoding — Wolfgang N疆asi-Donner <ed.odanow@...> 2007/10/21

Nobuyoshi Nakada schrieb:

[#12808] Re: Encoding::primary_encoding — Nobuyoshi Nakada <nobu@...> 2007/10/22

Hi,

[#12818] Re: Encoding::primary_encoding — Wolfgang N疆asi-Donner <ed.odanow@...> 2007/10/22

Nobuyoshi Nakada schrieb:

[#12820] Re: Encoding::primary_encoding — "Michal Suchanek" <hramrach@...> 2007/10/22

T24gMjIvMTAvMjAwNywgV29sZmdhbmcgTsOhZGFzaS1Eb25uZXIgPGVkLm9kYW5vd0B3b25hZG8u

[#12823] Re: Encoding::primary_encoding — Wolfgang Nádasi-Donner <ed.odanow@...> 2007/10/22

Michal Suchanek schrieb:

[#12824] Re: Encoding::primary_encoding — Nobuyoshi Nakada <nobu@...> 2007/10/22

Hi,

[#12767] \u escapes in string literals: proof of concept implementation — David Flanagan <david@...>

Back at the end of August, Matz wrote (see

45 messages 2007/10/19
[#12769] Re: \u escapes in string literals: proof of concept implementation — "Nobuyoshi Nakada" <nobu@...> 2007/10/19

Hi,

[#12782] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/20

Nobuyoshi Nakada wrote:

[#12831] Re: \u escapes in string literals: proof of concept implementation — Yukihiro Matsumoto <matz@...> 2007/10/22

Hi,

[#12841] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/22

Yukihiro Matsumoto wrote:

[#12862] Re: \u escapes in string literals: proof of concept implementation — Martin Duerst <duerst@...> 2007/10/23

At 04:19 07/10/23, David Flanagan wrote:

[#12864] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/23

Martin Duerst wrote:

[#12870] Re: \u escapes in string literals: proof of concept implementation — Martin Duerst <duerst@...> 2007/10/23

At 13:10 07/10/23, David Flanagan wrote:

[#12872] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/23

Martin Duerst wrote:

[#12936] Re: \u escapes in string literals: proof of concept implementation — Yukihiro Matsumoto <matz@...> 2007/10/25

Hi,

[#12980] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/26

Yukihiro Matsumoto wrote:

[#13028] Re: \u escapes in string literals: proof of concept implementation — Nobuyoshi Nakada <nobu@...> 2007/10/29

Hi,

[#13032] Re: \u escapes in string literals: proof of concept implementation — David Flanagan <david@...> 2007/10/29

Nobuyoshi Nakada wrote:

[#13034] Re: \u escapes in string literals: proof of concept implementation — Nobuyoshi Nakada <nobu@...> 2007/10/29

Hi,

[#13082] Re: \u escapes in string literals: proof of concept implementation — Martin Duerst <duerst@...> 2007/10/30

At 16:46 07/10/29, Nobuyoshi Nakada wrote:

[#13231] Re: \u escapes in string literals: proof of concept implementation — Nobuyoshi Nakada <nobu@...> 2007/11/06

Hi,

[#13234] Re: \u escapes in string literals: proof of concept implementation — Martin Duerst <duerst@...> 2007/11/06

At 11:29 07/11/06, Nobuyoshi Nakada wrote:

[#12825] clarification of ruby libraries installation paths? — Lucas Nussbaum <lucas@...>

Hi,

53 messages 2007/10/22
[#12830] Re: clarification of ruby libraries installation paths? — Ben Bleything <ben@...> 2007/10/22

On Mon, Oct 22, 2007, Lucas Nussbaum wrote:

[#12833] Re: clarification of ruby libraries installation paths? — Lucas Nussbaum <lucas@...> 2007/10/22

On 23/10/07 at 00:13 +0900, Ben Bleything wrote:

[#12835] Re: clarification of ruby libraries installation paths? — "Austin Ziegler" <halostatue@...> 2007/10/22

On 10/22/07, Lucas Nussbaum <lucas@lucas-nussbaum.net> wrote:

[#12836] Re: clarification of ruby libraries installation paths? — Lucas Nussbaum <lucas@...> 2007/10/22

On 23/10/07 at 01:55 +0900, Austin Ziegler wrote:

[#12888] Re: clarification of ruby libraries installation paths? — Gonzalo Garramu <ggarra@...> 2007/10/23

Lucas Nussbaum wrote:

[#12894] Re: clarification of ruby libraries installation paths? — Lucas Nussbaum <lucas@...> 2007/10/24

On 24/10/07 at 05:14 +0900, Gonzalo Garramu wrote:

[#13057] Re: clarification of ruby libraries installation paths? — Gonzalo Garramu <ggarra@...> 2007/10/29

Lucas Nussbaum wrote:

[#13058] Re: clarification of ruby libraries installation paths? — Lucas Nussbaum <lucas@...> 2007/10/29

On 30/10/07 at 07:28 +0900, Gonzalo Garramu wrote:

[#12848] Re: clarification of ruby libraries installation paths? — Sam Roberts <sroberts@...> 2007/10/22

On Tue, Oct 23, 2007 at 01:55:29AM +0900, Austin Ziegler wrote:

[#12855] Re: clarification of ruby libraries installation paths? — "Austin Ziegler" <halostatue@...> 2007/10/23

On 10/22/07, Sam Roberts <sroberts@uniserve.com> wrote:

[#13016] Re: clarification of ruby libraries installation paths? — bob@... (Bob Proulx) 2007/10/28

Austin Ziegler wrote:

[#13029] Re: clarification of ruby libraries installation paths? — "Austin Ziegler" <halostatue@...> 2007/10/29

On 10/28/07, Bob Proulx <bob@proulx.com> wrote:

[#13054] Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — Lucas Nussbaum <lucas@...> 2007/10/29

Austin,

[#13055] Re: Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — "Luis Lavena" <luislavena@...> 2007/10/29

On 10/29/07, Lucas Nussbaum <lucas@lucas-nussbaum.net> wrote:

[#13064] Re: Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — "Austin Ziegler" <halostatue@...> 2007/10/30

On 10/29/07, Luis Lavena <luislavena@gmail.com> wrote:

[#13066] Re: Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — "Luis Lavena" <luislavena@...> 2007/10/30

On 10/30/07, Austin Ziegler <halostatue@gmail.com> wrote:

[#13094] Re: Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — "Rick Bradley" <rick@...> 2007/10/30

Do we think that maybe, just maybe, things went off the rails when the

[#13095] Re: Austin Ziegler's behaviour (Was: clarification of ruby libraries installation paths?) — "Luis Lavena" <luislavena@...> 2007/10/30

On 10/30/07, Rick Bradley <rick@rickbradley.com> wrote:

[#12900] Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Wolfgang Nádasi-Donner <ed.odanow@...>

Dear Ruby 1.9 architects, developers, and testers!

31 messages 2007/10/24
[#12905] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Yukihiro Matsumoto <matz@...> 2007/10/24

Hi,

[#12907] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Wolfgang Nádasi-Donner <ed.odanow@...> 2007/10/24

Yukihiro Matsumoto schrieb:

[#12909] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Yukihiro Matsumoto <matz@...> 2007/10/24

Hi,

[#12940] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Wolfgang Nádasi-Donner <ed.odanow@...> 2007/10/25
[#12942] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Wolfgang Nádasi-Donner <ed.odanow@...> 2007/10/25

I have a (hopefully) final question before testing all

[#12948] Re: Hopefully Complete List of Possible Encoding Specifications - Existing Ones — Nobuyoshi Nakada <nobu@...> 2007/10/26

Hi,

[#12951] Fluent programming in Ruby — David Flanagan <david@...>

From the ChangeLog:

16 messages 2007/10/26

[#12996] General hash keys for colon notation — murphy <murphy@...>

Dear language designer(s) and parser wizards,

16 messages 2007/10/28

[#13027] Implementation of "guessUTF" method - final questions — Wolfgang Nádasi-Donner <ed.odanow@...>

Dear Ruby designers, developers, and testers!

22 messages 2007/10/29

[#13069] new Enumerable.butfirst method — David Flanagan <david@...>

Matz,

17 messages 2007/10/30

Re: \u escapes in string literals: proof of concept implementation

From: Nobuyoshi Nakada <nobu@...>
Date: 2007-10-25 06:17:51 UTC
List: ruby-core #12922
Hi,

At Tue, 23 Oct 2007 13:10:40 +0900,
David Flanagan wrote in [ruby-core:12864]:
> > That things are simpler is quite clear. But option a wouldn't be
> > difficult to implement, either, I guess. My suggestion is to stay
> > with option a until we have a better idea of which of b and c is
> > really needed/implementable/... 
> 
> The appeal, to me, of my current version is that it is independent of 
> the primary encoding.  \u escapes work no matter what -K option you 
> specify, and they always translate to a specific byte sequence. On the 
> other hand, I really don't know how to handle strings that mix 
> sjis-encoded Kanji characters with \u escapes.  What should the encoding 
> of the resulting string be?

That's garbage.

> So chosing option a might be the best bet: \u escapes cause an error 
> with -Ks or -Ke.  They are only allowed when the primary encoding is 
> ascii or utf-8.  I think that would mean that no transcoding would be 
> necessary.  I also think that my current patch doesn't need much 
> modification: just the addition of errors when the encoding does not 
> allow \u.

Another option is c of [ruby-core:12769].


Index: parse.y
===================================================================
--- parse.y	(revision 13774)
+++ parse.y	(working copy)
@@ -238,4 +238,5 @@ struct parser_params {
     int parser_ruby_sourceline;	/* current line no. */
     rb_encoding *enc;
+    rb_encoding *utf8;
 
 #ifndef RIPPER
@@ -261,8 +262,11 @@ struct parser_params {
 };
 
+#define UTF8_ENC() (parser->utf8 ? parser->utf8 : \
+		    (parser->utf8 = rb_enc_find("utf-8")))
 #define STR_NEW(p,n) rb_enc_str_new((p),(n),parser->enc)
 #define STR_NEW0() rb_str_new(0,0)
 #define STR_NEW2(p) rb_enc_str_new((p),strlen(p),parser->enc)
 #define STR_NEW3(p,n,m) parser_str_new((p),(n),STR_ENC(!ENC_SINGLE(m)),(m))
+#define STR_NEW4(p,n,e,m) parser_str_new((p),(n), (e), (m))
 #define STR_ENC(m) ((m)?parser->enc:rb_enc_from_index(0))
 #define ENC_SINGLE(cr) ((cr)==ENC_CODERANGE_SINGLE)
@@ -4488,5 +4492,5 @@ none		: /* none */
 
 static int parser_regx_options(struct parser_params*);
-static int parser_tokadd_string(struct parser_params*,int,int,int,long*,int*);
+static int parser_tokadd_string(struct parser_params*,int,int,int,long*,int*,rb_encoding**);
 static int parser_parse_string(struct parser_params*,NODE*);
 static int parser_here_document(struct parser_params*,NODE*);
@@ -4497,8 +4501,10 @@ static int parser_here_document(struct p
 # define tokspace(n)               parser_tokspace(parser, n)
 # define tokadd(c)                 parser_tokadd(parser, c)
-# define read_escape(m)            parser_read_escape(parser, m)
-# define tokadd_escape(t,m)        parser_tokadd_escape(parser, t, m)
+# define tok_hex(numlen)           parser_tok_hex(parser, numlen)
+# define tok_utf8(numlen,e)        parser_tok_utf8(parser, numlen, e)
+# define read_escape(flags,m,e)    parser_read_escape(parser, flags, m, e)
+# define tokadd_escape(t,m,e)      parser_tokadd_escape(parser, t, m, e)
 # define regx_options()            parser_regx_options(parser)
-# define tokadd_string(f,t,p,n,m)  parser_tokadd_string(parser,f,t,p,n,m)
+# define tokadd_string(f,t,p,n,m,e) parser_tokadd_string(parser,f,t,p,n,m, e)
 # define parse_string(n)           parser_parse_string(parser,n)
 # define here_document(n)          parser_here_document(parser,n)
@@ -4938,5 +4944,73 @@ parser_tokadd(struct parser_params *pars
 
 static int
-parser_read_escape(struct parser_params *parser, int *mb)
+parser_tok_hex(struct parser_params *parser, int *numlen)
+{
+    int c;
+
+    if (peek('{')) {
+	nextc();
+	c = scan_hex(lex_p, 8, numlen);
+	if (!*numlen) goto invalid;
+	if (!peek('}')) {
+	    yyerror("unterminated hex escape");
+	    return 0;
+	}
+	nextc();
+	*numlen += 2;
+    }
+    else {
+	c = scan_hex(lex_p, 2, numlen);
+	if (!*numlen) {
+	  invalid:
+	    yyerror("invalid hex escape");
+	    return 0;
+	}
+    }
+    return c;
+}
+
+static int
+parser_tok_utf8(struct parser_params *parser, int *numlen, rb_encoding **encp)
+{
+    int codepoint;
+
+    if (peek('{')) {  /* handle \u{...} form */
+	nextc();
+	codepoint = scan_hex(lex_p, 8, numlen);
+	if (*numlen == 0)  {
+	    yyerror("invalid Unicode escape");
+	    return 0;
+	}
+	if (codepoint > 0x7fffffff) {
+	    yyerror("illegal Unicode codepoint (too large)");
+	    return 0;
+	}
+	lex_p += *numlen;
+	if (!peek('}')) {
+	    yyerror("unterminated Unicode escape");
+	    return 0;
+	}
+	nextc();
+    }
+    else {			/* handle \uxxxx form */
+	codepoint = scan_hex(lex_p, 4, numlen);
+	if (*numlen < 4) {
+	    yyerror("invalid Unicode escape");
+	    return 0;
+	}
+	lex_p += 4;
+    }
+    if (codepoint >= 0x80) {
+	*encp = UTF8_ENC();
+    }
+
+    return codepoint;
+}
+
+#define ESCAPE_CONTROL 1
+#define ESCAPE_META    2
+
+static int
+parser_read_escape(struct parser_params *parser, int flags, int *mb, rb_encoding **encp)
 {
     int c;
@@ -4969,4 +5043,5 @@ parser_read_escape(struct parser_params 
       case '0': case '1': case '2': case '3': /* octal constant */
       case '4': case '5': case '6': case '7':
+	if (flags & (ESCAPE_CONTROL|ESCAPE_META)) goto eof;
 	{
 	    int numlen;
@@ -4980,13 +5055,21 @@ parser_read_escape(struct parser_params 
 
       case 'x':	/* hex constant */
+	if (flags & (ESCAPE_CONTROL|ESCAPE_META)) goto eof;
 	{
 	    int numlen;
 
-	    c = scan_hex(lex_p, 2, &numlen);
-	    if (numlen == 0) {
-		yyerror("Invalid escape character syntax");
-		return 0;
-	    }
-	    lex_p += numlen;
+	    c = tok_hex(&numlen);
+	    if (numlen == 0) goto eof;
+	}
+	if (mb && (c >= 0x80)) *mb = ENC_CODERANGE_UNKNOWN;
+	return c;
+
+      case 'u':	/* hex constant */
+	if (flags & (ESCAPE_CONTROL|ESCAPE_META)) goto eof;
+	{
+	    int numlen;
+
+	    c = tok_utf8(&numlen, encp);
+	    if (numlen == 0) goto eof;
 	}
 	if (mb && (c >= 0x80)) *mb = ENC_CODERANGE_UNKNOWN;
@@ -5000,12 +5083,12 @@ parser_read_escape(struct parser_params 
 
       case 'M':
+	if (flags & ESCAPE_META) goto eof;
 	if ((c = nextc()) != '-') {
-	    yyerror("Invalid escape character syntax");
 	    pushback(c);
-	    return '\0';
+	    goto eof;
 	}
 	if ((c = nextc()) == '\\') {
 	    if (mb) *mb = ENC_CODERANGE_UNKNOWN;
-	    return read_escape(0) | 0x80;
+	    return read_escape(flags|ESCAPE_META, 0, encp) | 0x80;
 	}
 	else if (c == -1) goto eof;
@@ -5017,11 +5100,11 @@ parser_read_escape(struct parser_params 
       case 'C':
 	if ((c = nextc()) != '-') {
-	    yyerror("Invalid escape character syntax");
 	    pushback(c);
-	    return '\0';
+	    goto eof;
 	}
       case 'c':
+	if (flags & ESCAPE_CONTROL) goto eof;
 	if ((c = nextc())== '\\') {
-	    c = read_escape(mb);
+	    c = read_escape(flags|ESCAPE_CONTROL, mb, encp);
 	}
 	else if (c == '?')
@@ -5040,9 +5123,13 @@ parser_read_escape(struct parser_params 
 }
 
+#define tokcopy(n) memcpy(tokspace(n), lex_p - (n), (n))
+
 static int
-parser_tokadd_escape(struct parser_params *parser, int term, int *mb)
+parser_tokadd_escape(struct parser_params *parser, int term, int *mb, rb_encoding **encp)
 {
     int c;
+    int flags = 0;
 
+  first:
     switch (c = nextc()) {
       case '\n':
@@ -5051,17 +5138,13 @@ parser_tokadd_escape(struct parser_param
       case '0': case '1': case '2': case '3': /* octal constant */
       case '4': case '5': case '6': case '7':
+	if (flags & (ESCAPE_CONTROL|ESCAPE_META)) goto eof;
 	{
 	    int numlen;
 	    int oct;
 
-	    tokadd('\\');
-	    pushback(c);
-	    oct = scan_oct(lex_p, 3, &numlen);
-	    if (numlen == 0) {
-		yyerror("Invalid escape character syntax");
-		return -1;
-	    }
-	    while (numlen--)
-		tokadd(nextc());
+	    oct = scan_oct(--lex_p, 3, &numlen);
+	    if (numlen == 0) goto eof;
+	    lex_p += numlen;
+	    tokcopy(numlen + 1);
 	    if (mb && (oct >= 0200)) *mb = ENC_CODERANGE_UNKNOWN;
 	}
@@ -5069,45 +5152,59 @@ parser_tokadd_escape(struct parser_param
 
       case 'x':	/* hex constant */
+	if (flags & (ESCAPE_CONTROL|ESCAPE_META)) goto eof;
 	{
 	    int numlen;
 	    int hex;
 
-	    tokadd('\\');
-	    tokadd(c);
-	    hex = scan_hex(lex_p, 2, &numlen);
-	    if (numlen == 0) {
-		yyerror("Invalid escape character syntax");
-		return -1;
-	    }
-	    while (numlen--)
-		tokadd(nextc());
+	    hex = tok_hex(&numlen);
+	    if (numlen == 0) goto eof;
+	    lex_p += numlen;
+	    tokcopy(numlen + 2);
 	    if (mb && (hex >= 0x80)) *mb = ENC_CODERANGE_UNKNOWN;
 	}
 	return 0;
 
+      case 'u':	/* Unicode constant */
+	if (flags & (ESCAPE_CONTROL|ESCAPE_META)) goto eof;
+	{
+	    int numlen;
+	    int uc;
+
+	    uc = tok_utf8(&numlen, encp);
+	    if (numlen == 0) goto eof;
+	    lex_p += numlen;
+	    tokcopy(numlen + 2);
+	    if (mb && (uc >= 0x80)) *mb = ENC_CODERANGE_MULTI;
+	    if (uc >= 0x80) return 1;
+	}
+	return 0;
+
       case 'M':
+	if (flags & ESCAPE_META) goto eof;
 	if ((c = nextc()) != '-') {
-	    yyerror("Invalid escape character syntax");
 	    pushback(c);
-	    return 0;
+	    goto eof;
 	}
-	tokadd('\\'); tokadd('M'); tokadd('-');
+	tokcopy(3);
 	if (mb) *mb = ENC_CODERANGE_UNKNOWN;
+	flags |= ESCAPE_META;
 	goto escaped;
 
       case 'C':
+	if (flags & ESCAPE_CONTROL) goto eof;
 	if ((c = nextc()) != '-') {
-	    yyerror("Invalid escape character syntax");
 	    pushback(c);
-	    return 0;
+	    goto eof;
 	}
-	tokadd('\\'); tokadd('C'); tokadd('-');
+	tokcopy(3);
 	goto escaped;
 
       case 'c':
-	tokadd('\\'); tokadd('c');
+	if (flags & ESCAPE_CONTROL) goto eof;
+	tokcopy(2);
+	flags |= ESCAPE_CONTROL;
       escaped:
 	if ((c = nextc()) == '\\') {
-	    return tokadd_escape(term, mb);
+	    goto first;
 	}
 	else if (c == -1) goto eof;
@@ -5190,16 +5287,47 @@ parser_tokadd_mbchar(struct parser_param
 {
     int len = parser_mbclen();
-    do {
-	tokadd(c);
-    } while (--len > 0 && (c = nextc()) != -1);
+    tokadd(c);
+    lex_p += --len;
+    if (len > 0) tokcopy(len);
 }
 
 #define tokadd_mbchar(c) parser_tokadd_mbchar(parser, c)
 
+static void
+parser_tokaddmbc(struct parser_params *parser, int c, rb_encoding *enc)
+{
+    int len = rb_enc_codelen(c, enc);
+    rb_enc_mbcput(c, tokspace(len), enc);
+}
+
+#define tokaddmbc(c, enc) parser_tokaddmbc(parser, c, enc)
+
 static int
 parser_tokadd_string(struct parser_params *parser,
-		     int func, int term, int paren, long *nest, int *mb)
+		     int func, int term, int paren, long *nest,
+		     int *mb, rb_encoding **encp)
 {
     int c;
+    int has_mb = 0;
+    rb_encoding *enc = *encp;
+    char *errbuf = 0;
+    static const char mixed_msg[] = "%s mixed within %s source";
+
+#define mixed_error(enc1, enc2) if (!errbuf) {	\
+	int len = sizeof(mixed_msg) - 4;	\
+	len += strlen(rb_enc_name(enc1));	\
+	len += strlen(rb_enc_name(enc2));	\
+	errbuf = ALLOCA_N(char, len);		\
+	snprintf(errbuf, len, mixed_msg,	\
+		 rb_enc_name(enc1),		\
+		 rb_enc_name(enc2));		\
+	yyerror(errbuf);			\
+    }
+#define mixed_escape(beg, enc1, enc2) do {	\
+	const char *pos = lex_p;		\
+	lex_p = beg;				\
+	mixed_error(enc1, enc2);		\
+	lex_p = pos;				\
+    } while (0)
 
     while ((c = nextc()) != -1) {
@@ -5222,4 +5350,5 @@ parser_tokadd_string(struct parser_param
 	}
 	else if (c == '\\') {
+	    const char *beg = lex_p - 1;
 	    c = nextc();
 	    switch (c) {
@@ -5237,6 +5366,9 @@ parser_tokadd_string(struct parser_param
 		if (func & STR_FUNC_REGEXP) {
 		    pushback(c);
-		    if (tokadd_escape(term, mb) < 0)
+		    if ((c = tokadd_escape(term, mb, &enc)) < 0)
 			return -1;
+		    if (has_mb && enc != *encp) {
+			mixed_escape(beg, enc, *encp);
+		    }
 		    continue;
 		}
@@ -5244,5 +5376,13 @@ parser_tokadd_string(struct parser_param
 		    pushback(c);
 		    if (func & STR_FUNC_ESCAPE) tokadd('\\');
-		    c = read_escape(mb);
+		    c = read_escape(0, mb, &enc);
+		    if (has_mb && enc != *encp) {
+			mixed_escape(beg, enc, *encp);
+			continue;
+		    }
+		    if (c >= 0x80) {
+			tokaddmbc(c, enc);
+			continue;
+		    }
 		}
 		else if ((func & STR_FUNC_QWORDS) && ISSPACE(c)) {
@@ -5255,4 +5395,9 @@ parser_tokadd_string(struct parser_param
 	}
 	else if (parser_ismbchar()) {
+	    has_mb = 1;
+	    if (enc != *encp) {
+		mixed_error(enc, *encp);
+		continue;
+	    }
 	    tokadd_mbchar(c);
 	    if (mb) *mb = ENC_CODERANGE_MULTI;
@@ -5270,4 +5415,5 @@ parser_tokadd_string(struct parser_param
 	tokadd(c);
     }
+    *encp = enc;
     return c;
 }
@@ -5283,4 +5429,5 @@ parser_parse_string(struct parser_params
     int paren = nd_paren(quote);
     int c, space = 0, mb = ENC_CODERANGE_SINGLE;
+    rb_encoding *enc = parser->enc;
 
     if (func == -1) return tSTRING_END;
@@ -5316,12 +5463,11 @@ parser_parse_string(struct parser_params
     }
     pushback(c);
-    if (tokadd_string(func, term, paren, &quote->nd_nest, &mb) == -1) {
+    if (tokadd_string(func, term, paren, &quote->nd_nest, &mb, &enc) == -1) {
+	ruby_sourceline = nd_line(quote);
 	if (func & STR_FUNC_REGEXP) {
-	    ruby_sourceline = nd_line(quote);
 	    compile_error(PARSER_ARG "unterminated regexp meets end of file");
 	    return tREGEXP_END;
 	}
 	else {
-	    ruby_sourceline = nd_line(quote);
 	    compile_error(PARSER_ARG "unterminated string meets end of file");
 	    return tSTRING_END;
@@ -5330,5 +5476,5 @@ parser_parse_string(struct parser_params
 
     tokfix();
-    set_yylval_str(STR_NEW3(tok(), toklen(), mb));
+    set_yylval_str(STR_NEW4(tok(), toklen(), enc, mb));
     return tSTRING_CONTENT;
 }
@@ -5494,4 +5640,5 @@ parser_here_document(struct parser_param
     else {
 	int mb = ENC_CODERANGE_SINGLE, *mbp = &mb;
+	rb_encoding *enc = parser->enc;
 	newtok();
 	if (c == '#') {
@@ -5508,7 +5655,7 @@ parser_here_document(struct parser_param
 	do {
 	    pushback(c);
-	    if ((c = tokadd_string(func, '\n', 0, NULL, mbp)) == -1) goto error;
+	    if ((c = tokadd_string(func, '\n', 0, NULL, mbp, &enc)) == -1) goto error;
 	    if (c != '\n') {
-		set_yylval_str(STR_NEW3(tok(), toklen(), mb));
+		set_yylval_str(STR_NEW4(tok(), toklen(), enc, mb));
 		return tSTRING_CONTENT;
 	    }
@@ -5517,5 +5664,5 @@ parser_here_document(struct parser_param
 	    if ((c = nextc()) == -1) goto error;
 	} while (!whole_match_p(eos, len, indent));
-	str = STR_NEW3(tok(), toklen(), mb);
+	str = STR_NEW4(tok(), toklen(), enc, mb);
     }
     heredoc_restore(lex_strterm);
@@ -5778,4 +5925,5 @@ parser_yylex(struct parser_params *parse
     enum lex_state_e last_state;
     int mb;
+    rb_encoding *enc;
 #ifdef RIPPER
     int fallthru = Qfalse;
@@ -6099,4 +6247,5 @@ parser_yylex(struct parser_params *parse
 	}
 	newtok();
+	enc = parser->enc;
 	if (parser_ismbchar()) {
 	    mb = ENC_CODERANGE_MULTI;
@@ -6107,8 +6256,7 @@ parser_yylex(struct parser_params *parse
 	    goto ternary;
 	}
-	else if (c == '\\' && (c = read_escape(0)) >= 0x80) {
-	    rb_encoding *enc = parser->enc;
+	else if (c == '\\' && (c = read_escape(0, 0, &enc)) >= 0x80) {
 	    mb = ENC_CODERANGE_UNKNOWN;
-	    rb_enc_mbcput(c, tokspace(rb_enc_codelen(c, enc)), enc);
+	    tokaddmbc(c, enc);
 	}
 	else {
@@ -6117,5 +6265,5 @@ parser_yylex(struct parser_params *parse
 	}
 	tokfix();
-	set_yylval_str(STR_NEW3(tok(), toklen(), mb));
+	set_yylval_str(parser_str_new(tok(), toklen(), enc, mb));
 	lex_state = EXPR_ENDARG;
 	return tCHAR;
@@ -7187,4 +7335,15 @@ list_concat_gen(struct parser_params *pa
 }
 
+static void
+literal_concat0(struct parser_params *parser, VALUE head, VALUE tail)
+{
+    if (!rb_enc_compatible(head, tail)) {
+	compile_error(PARSER_ARG "string literal encodings differ (%s / %s)",
+		      rb_enc_name(rb_enc_get(head)),
+		      rb_enc_name(rb_enc_get(tail)));
+    }
+    rb_str_buf_append(head, tail);
+}
+
 /* concat two string literals */
 static NODE *
@@ -7204,5 +7363,5 @@ literal_concat_gen(struct parser_params 
       case NODE_STR:
 	if (htype == NODE_STR) {
-	    rb_str_concat(head->nd_lit, tail->nd_lit);
+	    literal_concat0(parser, head->nd_lit, tail->nd_lit);
 	    rb_gc_force_recycle((VALUE)tail);
 	}
@@ -7214,5 +7373,5 @@ literal_concat_gen(struct parser_params 
       case NODE_DSTR:
 	if (htype == NODE_STR) {
-	    rb_str_concat(head->nd_lit, tail->nd_lit);
+	    literal_concat0(parser, head->nd_lit, tail->nd_lit);
 	    tail->nd_lit = head->nd_lit;
 	    rb_gc_force_recycle((VALUE)head);


-- 
Nobu Nakada

In This Thread