[#18436] [ANN] Ruby 1.9.1 feature freeze — "Yugui (Yuki Sonoda)" <yugui@...>

Hi all,

81 messages 2008/09/02
[#18667] Re: [ANN] Ruby 1.9.1 feature freeze — "Yusuke ENDOH" <mame@...> 2008/09/17

Hi,

[#18847] Re: [ANN] Ruby 1.9.1 feature freeze — "Yugui (Yuki Sonoda)" <yugui@...> 2008/09/24

Hi, Yusuke

[#18848] Re: [ANN] Ruby 1.9.1 feature freeze — "Yusuke ENDOH" <mame@...> 2008/09/24

Hi,

[#18886] Re: [ANN] Ruby 1.9.1 feature freeze — Ryan Davis <ryand-ruby@...> 2008/09/25

[#18889] Re: [ANN] Ruby 1.9.1 feature freeze — SASADA Koichi <ko1@...> 2008/09/25

Ryan Davis wrote:

[#18906] Re: [ANN] Ruby 1.9.1 feature freeze — Dave Thomas <dave@...> 2008/09/25

[#18908] Re: [ANN] Ruby 1.9.1 feature freeze — SASADA Koichi <ko1@...> 2008/09/25

Dave Thomas wrote:

[#19032] Re: [ANN] Ruby 1.9.1 feature freeze — Ryan Davis <ryand-ruby@...> 2008/09/30

[#19036] Re: [ANN] Ruby 1.9.1 feature freeze — Jim Weirich <jim.weirich@...> 2008/09/30

[#19039] Re: [ANN] Ruby 1.9.1 feature freeze — Ryan Davis <ryand-ruby@...> 2008/09/30

[#19042] Re: [ANN] Ruby 1.9.1 feature freeze — Dave Thomas <dave@...> 2008/09/30

[#19195] Re: [ANN] Ruby 1.9.1 feature freeze — Ryan Davis <ryand-ruby@...> 2008/10/08

[#19202] Re: [ANN] Ruby 1.9.1 feature freeze — "Austin Ziegler" <halostatue@...> 2008/10/08

On Wed, Oct 8, 2008 at 3:05 AM, Ryan Davis <ryand-ruby@zenspider.com> wrote=

[#19203] Re: [ANN] Ruby 1.9.1 feature freeze — Paul Brannan <pbrannan@...> 2008/10/08

On Wed, Oct 08, 2008 at 09:28:22PM +0900, Austin Ziegler wrote:

[#18452] [ANN] Ruby 1.9.1 feature freeze — "Roger Pack" <rogerpack2005@...>

Would it be possible to have a few patches applied before freeze [if

27 messages 2008/09/04
[#18471] Re: [ANN] Ruby 1.9.1 feature freeze — Yukihiro Matsumoto <matz@...> 2008/09/06

Hi,

[#18490] Re: [ANN] Ruby 1.9.1 feature freeze — Nobuyoshi Nakada <nobu@...> 2008/09/08

Hi,

[#18486] Ruby 1.9 strings & character encoding — "Michael Selig" <michael.selig@...>

Firstly, I apologise if I am going over old ground here - I haven't been

39 messages 2008/09/08
[#18492] Re: Ruby 1.9 strings & character encoding — Yukihiro Matsumoto <matz@...> 2008/09/08

Hi,

[#18494] Re: Ruby 1.9 strings & character encoding — "Michael Selig" <michael.selig@...> 2008/09/08

On Mon, 08 Sep 2008 19:45:36 +1000, Yukihiro Matsumoto

[#18499] Re: Ruby 1.9 strings & character encoding — "NARUSE, Yui" <naruse@...> 2008/09/08

Hi,

[#18500] Re: Ruby 1.9 strings & character encoding — Tim Bray <Tim.Bray@...> 2008/09/08

On Sep 8, 2008, at 10:43 AM, NARUSE, Yui wrote:

[#18515] Re: Ruby 1.9 strings & character encoding — Urabe Shyouhei <shyouhei@...> 2008/09/09

# First off, I'm neutral to this issue

[#18530] Re: Ruby 1.9 strings & character encoding — Tim Bray <Tim.Bray@...> 2008/09/10

On Sep 8, 2008, at 9:06 PM, Urabe Shyouhei wrote:

[#18533] Re: Ruby 1.9 strings & character encoding — Tanaka Akira <akr@...> 2008/09/10

In article <3119E5AB-AEC8-4FEE-B2FA-8C75482E0E9D@sun.com>,

[#18504] Re: Ruby 1.9 strings & character encoding — "Michael Selig" <michael.selig@...> 2008/09/09

On Tue, 09 Sep 2008 03:43:54 +1000, NARUSE, Yui <naruse@airemix.jp> wrote:

[#18572] Working on CSV's Encoding Support — James Gray <james@...>

I'm trying to get the standard CSV library ready for m17n in Ruby

23 messages 2008/09/13
[#18575] Re: Working on CSV's Encoding Support — James Gray <james@...> 2008/09/14

On Sep 13, 2008, at 5:39 PM, James Gray wrote:

[#18576] Re: Working on CSV's Encoding Support — "Michael Selig" <michael.selig@...> 2008/09/14

On Sun, 14 Sep 2008 14:48:47 +1000, James Gray <james@grayproductions.net>

[#18640] Character encodings - a radical suggestion — "Michael Selig" <michael.selig@...>

Hi,

89 messages 2008/09/17
[#18643] Re: Character encodings - a radical suggestion — James Gray <james@...> 2008/09/17

On Sep 16, 2008, at 8:20 PM, Michael Selig wrote:

[#18647] Re: Character encodings - a radical suggestion — "Michael Selig" <michael.selig@...> 2008/09/17

On Wed, 17 Sep 2008 12:51:14 +1000, James Gray <james@grayproductions.net>

[#18658] Re: Character encodings - a radical suggestion — James Gray <james@...> 2008/09/17

On Sep 16, 2008, at 11:21 PM, Michael Selig wrote:

[#18660] Re: Character encodings - a radical suggestion — "NARUSE, Yui" <naruse@...> 2008/09/17

Hi,

[#18663] Re: Character encodings - a radical suggestion — Matthias Wächter <matthias@...> 2008/09/17

On 9/17/2008 3:39 PM, NARUSE, Yui wrote:

[#18666] Re: Character encodings - a radical suggestion — Yukihiro Matsumoto <matz@...> 2008/09/17

Hi,

[#18728] Re: Character encodings - a radical suggestion — Martin Duerst <duerst@...> 2008/09/19

At 00:01 08/09/18, Yukihiro Matsumoto wrote:

[#18729] Re: Character encodings - a radical suggestion — Yukihiro Matsumoto <matz@...> 2008/09/19

Hi,

[#18732] Re: Character encodings - a radical suggestion — "Michael Selig" <michael.selig@...> 2008/09/19

On Fri, 19 Sep 2008 18:24:41 +1000, Yukihiro Matsumoto

[#18734] Re: Character encodings - a radical suggestion — Yukihiro Matsumoto <matz@...> 2008/09/19

Oops, I misfired my mail reader; the following is the right one:

[#18751] Re: Character encodings - a radical suggestion — "Michael Selig" <michael.selig@...> 2008/09/20

On Fri, 19 Sep 2008 19:52:30 +1000, Yukihiro Matsumoto

[#18761] Re: Character encodings - a radical suggestion — Yukihiro Matsumoto <matz@...> 2008/09/20

Hi,

[#18774] Re: Character encodings - a radical suggestion — "Michael Selig" <michael.selig@...> 2008/09/21

On Sun, 21 Sep 2008 02:05:30 +1000, Yukihiro Matsumoto

[#18776] Re: Character encodings - a less radical suggestion — Martin Duerst <duerst@...> 2008/09/22

Hello Michael,

[#18664] Re: Character encodings - a radical suggestion — Yukihiro Matsumoto <matz@...> 2008/09/17

Hi,

[#18762] [Feature #578] add method to disassemble Proc objects — Roger Pack <redmine@...>

Feature #578: add method to disassemble Proc objects

17 messages 2008/09/20

[#18872] [RIP] Guy Decoux. — "Jean-Fran輟is Tr穗" <jftran@...>

Hello,

14 messages 2008/09/24

[#18899] refute_{equal, match, nil, same} is not useful — Fujioka <fuj@...>

Hi,

27 messages 2008/09/25

[#18937] A stupid question... — Dave Thomas <dave@...>

Just what was wrong with Test::Unit? Sure, it was slightly bloated.

25 messages 2008/09/25
[#18941] Re: A stupid question... — "Berger, Daniel" <Daniel.Berger@...> 2008/09/25

> -----Original Message-----

[#19004] Let Ruby be Ruby — Trans <transfire@...> 2008/09/28

[#18986] miniunit problems and release of Ruby 1.9.0-5 — "Yugui (Yuki Sonoda)" <yugui@...>

Hi,

14 messages 2008/09/27

[#19043] Ruby is "stealing" names from operating system API:s — "Johan Holmberg" <johan556@...>

Hi!

13 messages 2008/09/30

[ruby-core:18616] Re: Ruby 1.9 string performance

From: "Michael Selig" <michael.selig@...>
Date: 2008-09-16 05:11:59 UTC
List: ruby-core #18616
> On Fri, 12 Sep 2008 02:16:51 +1000, NARUSE, Yui <naruse@airemix.jp>  
> wrote:
>
>>
>> If you split your patch into small atomic patches,
>> your patch will be merged rapidly.
>>
Here are 3 other patches for String performance. Please apply  
"codepoint.pat" last after all the other patches (including the 2 from the  
previous mail) becuase it overlaps.

Details for ChangeLog:

casecmp.pat:
- Optimize String#casecmp for single-byte character strings

case.pat:
- Optimize String#upcase, downcase & swapcase for single-byte character  
strings

codepoint.pat:
- Added new rb_enc_codepoint_l() function to encoding.c which returns the  
codepoint, same as rb_enc_codepoint(), plus returns the character length
- Modified string.c to use it, avoiding extra calls to determine length of  
character
- Changed "single_byte_optimizable()" to a #define (for compilers which  
don't do "inline" properly)
- All these changes make many methods on multi-byte character strings  
somewhat faster (maybe 4 -5% on UTF-8 - haven't tested others, but I think  
should be similar)

I also have some comments and questions:

1) Currently "String#rstrip" on multi-byte character sets seems to work  
 from the start to the end of the string. Can't it work backwards, which  
would be faster?

2) A recent change to tr_trans() (used by String#tr & others) fix the  
"coderange" issue mentioned earlier, sets the coderange of the result to  
that of the calling string object but only if the coderange of both the  
"from" string and the "to" string are the same as the input string. It is  
my understanding that the coderange of the result is dependent upon only  
the "calling" string object and the "to" string - not the "from" string.  
If this is right (please tell me if I am not!), then I think a better  
implementation is to use something like
	cr = ENC_CODERANGE_AND(ENC_CODERANGE(str), ENC_CODERANGE(to_str));
because this will preserve the "valid" flag if one of the strings is 7-bit  
ascii, and the other isn't (eg: UTF-8).

3) rb_str_modify() is actually a slight problem due to the fact that it  
clears the coderange flags. In many cases you then have to reset them back  
the way they were to avoid costly string re-scans. But sometimes you  
actually may want to reset the flags if they indicated "broken" and an  
"innocuous" change is then made (eg: changing a byte), because the change  
may make the string valid again. It seems to me that a neat implementation  
would be to have a function called say "str_modify()" which is almost the  
same as rb_str_modify()", but if the coderange says "broken", it clears it  
(forcing a leter rescan). If the coderange is valid it should leave it.  
Then this new function can be used in most places in string.c to save  
mucking around with the coderange flags.

Cheers,
Mike

Attachments (3)

case.pat (3.48 KB, text/x-diff)
Index: string.c
===================================================================
--- string.c	(revision 19374)
+++ string.c	(working copy)
@@ -4041,17 +4041,30 @@
     rb_str_modify(str);
     enc = STR_ENC_GET(str);
     s = RSTRING_PTR(str); send = RSTRING_END(str);
-    while (s < send) {
-	unsigned int c = rb_enc_codepoint(s, send, enc);
+    if (single_byte_optimizable(str)) {
+	while (s < send) {
+	    unsigned int c = *(unsigned char *)s;
 
-	if (rb_enc_islower(c, enc)) {
-	    /* assuming toupper returns codepoint with same size */
-	    rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
-	    modify = 1;
+	    if (rb_enc_islower(c, enc)) {
+		*s = rb_enc_toupper(c , enc);
+		modify = 1;
+	    }
+	    s++;
 	}
-	s += rb_enc_codelen(c, enc);
     }
+    else {
+	while (s < send) {
+	    unsigned int c = rb_enc_codepoint(s, send, enc);
 
+	    if (rb_enc_islower(c, enc)) {
+		/* assuming toupper returns codepoint with same size */
+		rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
+		modify = 1;
+	    }
+	    s += rb_enc_codelen(c, enc);
+	}
+    }
+
     ENC_CODERANGE_SET(str, cr);
     if (modify) return str;
     return Qnil;
@@ -4099,17 +4112,30 @@
     rb_str_modify(str);
     enc = STR_ENC_GET(str);
     s = RSTRING_PTR(str); send = RSTRING_END(str);
-    while (s < send) {
-	unsigned int c = rb_enc_codepoint(s, send, enc);
+    if (single_byte_optimizable(str)) {
+	while (s < send) {
+	    unsigned int c = *(unsigned char *)s;
 
-	if (rb_enc_isupper(c, enc)) {
-	    /* assuming toupper returns codepoint with same size */
-	    rb_enc_mbcput(rb_enc_tolower(c, enc), s, enc);
-	    modify = 1;
+	    if (rb_enc_isupper(c, enc)) {
+		*s = rb_enc_tolower(c , enc);
+		modify = 1;
+	    }
+	    s++;
 	}
-	s += rb_enc_codelen(c, enc);
     }
+    else {
+	while (s < send) {
+	    unsigned int c = rb_enc_codepoint(s, send, enc);
 
+	    if (rb_enc_isupper(c, enc)) {
+		/* assuming tolower returns codepoint with same size */
+		rb_enc_mbcput(rb_enc_tolower(c, enc), s, enc);
+		modify = 1;
+	    }
+	    s += rb_enc_codelen(c, enc);
+	}
+    }
+
     ENC_CODERANGE_SET(str, cr);
     if (modify) return str;
     return Qnil;
@@ -4228,20 +4254,37 @@
     rb_str_modify(str);
     enc = STR_ENC_GET(str);
     s = RSTRING_PTR(str); send = RSTRING_END(str);
-    while (s < send) {
-	unsigned int c = rb_enc_codepoint(s, send, enc);
+    if (single_byte_optimizable(str)) {
+	while (s < send) {
+	    unsigned int c = *(unsigned char *)s;
 
-	if (rb_enc_isupper(c, enc)) {
-	    /* assuming toupper returns codepoint with same size */
-	    rb_enc_mbcput(rb_enc_tolower(c, enc), s, enc);
-	    modify = 1;
+	    if (rb_enc_isupper(c, enc)) {
+		*s = rb_enc_tolower(c , enc);
+		modify = 1;
+	    }
+	    else if (rb_enc_islower(c, enc)) {
+		*s = rb_enc_toupper(c , enc);
+		modify = 1;
+	    }
+	    s++;
 	}
-	else if (rb_enc_islower(c, enc)) {
-	    /* assuming toupper returns codepoint with same size */
-	    rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
-	    modify = 1;
+    }
+    else {
+	while (s < send) {
+	    unsigned int c = rb_enc_codepoint(s, send, enc);
+
+	    if (rb_enc_isupper(c, enc)) {
+		/* assuming toupper returns codepoint with same size */
+		rb_enc_mbcput(rb_enc_tolower(c, enc), s, enc);
+		modify = 1;
+	    }
+	    else if (rb_enc_islower(c, enc)) {
+		/* assuming toupper returns codepoint with same size */
+		rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
+		modify = 1;
+	    }
+	    s += rb_enc_codelen(c, enc);
 	}
-	s += rb_enc_codelen(c, enc);
     }
 
     ENC_CODERANGE_SET(str, cr);
casecmp.pat (1.53 KB, text/x-diff)
Index: string.c
===================================================================
--- string.c	(revision 19374)
+++ string.c	(working copy)
@@ -2067,19 +2067,33 @@
 
     p1 = RSTRING_PTR(str1); p1end = RSTRING_END(str1);
     p2 = RSTRING_PTR(str2); p2end = RSTRING_END(str2);
-    while (p1 < p1end && p2 < p2end) {
-	unsigned int c1 = rb_enc_codepoint(p1, p1end, enc);
-	unsigned int c2 = rb_enc_codepoint(p2, p2end, enc);
+    if (single_byte_optimizable(str1) && single_byte_optimizable(str2)) {
+	while (p1 < p1end && p2 < p2end) {
+	    if (*p1 != *p2) {
+		int c1 = rb_enc_toupper(*(unsigned char *)p1, enc);
+		int c2 = rb_enc_toupper(*(unsigned char *)p2, enc);
+		if (c1 > c2) return INT2FIX(1);
+		if (c1 < c2) return INT2FIX(-1);
+	    }
+	    p1++;
+	    p2++;
+	}
+    }
+    else {
+	while (p1 < p1end && p2 < p2end) {
+	    unsigned int c1 = rb_enc_codepoint(p1, p1end, enc);
+	    unsigned int c2 = rb_enc_codepoint(p2, p2end, enc);
 
-	if (c1 != c2) {
-	    c1 = rb_enc_toupper(c1, enc);
-	    c2 = rb_enc_toupper(c2, enc);
-	    if (c1 > c2) return INT2FIX(1);
-	    if (c1 < c2) return INT2FIX(-1);
+	    if (c1 != c2) {
+		c1 = rb_enc_toupper(c1, enc);
+		c2 = rb_enc_toupper(c2, enc);
+		if (c1 > c2) return INT2FIX(1);
+		if (c1 < c2) return INT2FIX(-1);
+	    }
+	    len = rb_enc_codelen(c1, enc);
+	    p1 += len;
+	    p2 += len;
 	}
-	len = rb_enc_codelen(c1, enc);
-	p1 += len;
-	p2 += len;
     }
     if (RSTRING_LEN(str1) == RSTRING_LEN(str2)) return INT2FIX(0);
     if (RSTRING_LEN(str1) > RSTRING_LEN(str2)) return INT2FIX(1);
codepoint.pat (8.95 KB, text/x-diff)
Index: encoding.c
===================================================================
--- encoding.c	(revision 19374)
+++ encoding.c	(working copy)
@@ -768,6 +768,22 @@
 	rb_raise(rb_eArgError, "invalid byte sequence in %s", rb_enc_name(enc));
 }
 
+/* As above, but also return character length */
+unsigned int
+rb_enc_codepoint_l(const char *p, const char *e, int *len, rb_encoding *enc)
+{
+    int r;
+    if (e <= p)
+        rb_raise(rb_eArgError, "empty string");
+    r = rb_enc_precise_mbclen(p, e, enc);
+    if (MBCLEN_CHARFOUND_P(r)) {
+	*len = r;
+        return rb_enc_mbc_to_codepoint(p, e, enc);
+    }
+    else
+	rb_raise(rb_eArgError, "invalid byte sequence in %s", rb_enc_name(enc));
+}
+
 int
 rb_enc_codelen(int c, rb_encoding *enc)
 {
Index: include/ruby/encoding.h
===================================================================
--- include/ruby/encoding.h	(revision 19374)
+++ include/ruby/encoding.h	(working copy)
@@ -121,6 +121,7 @@
 
 /* -> code or raise exception */
 unsigned int rb_enc_codepoint(const char *p, const char *e, rb_encoding *enc);
+unsigned int rb_enc_codepoint_l(const char *p, const char *e, int *len, rb_encoding *enc);
 #define rb_enc_mbc_to_codepoint(p, e, enc) ONIGENC_MBC_TO_CODE(enc,(UChar*)(p),(UChar*)(e))
 
 /* -> codelen>0 or raise exception */
Index: string.c
===================================================================
--- string.c.old	2008-09-16 13:00:12.000000000 +1000
+++ string.c	2008-09-16 13:05:11.000000000 +1000
@@ -112,23 +112,9 @@
 
 #define STR_ENC_GET(str) rb_enc_from_index(ENCODING_GET(str))
 
-static inline int
-single_byte_optimizable(VALUE str)
-{
-    rb_encoding *enc;
-
-    /* Conservative.  It may be ENC_CODERANGE_UNKNOWN. */
-    if (ENC_CODERANGE(str) == ENC_CODERANGE_7BIT)
-        return 1;
-
-    enc = STR_ENC_GET(str);
-    if (rb_enc_mbmaxlen(enc) == 1)
-        return 1;
-
-    /* Conservative.  Possibly single byte.
-     * "\xa1" in Shift_JIS for example. */
-    return 0;
-}
+/* Conservative.  Possibly single byte.
+ * "\xa1" in Shift_JIS for example. */
+#define single_byte_optimizable(str) (ENC_CODERANGE(str) == ENC_CODERANGE_7BIT || rb_enc_mbmaxlen(STR_ENC_GET(str)) == 1)
 
 VALUE rb_fs;
 
@@ -2076,7 +2062,7 @@
 static VALUE
 rb_str_casecmp(VALUE str1, VALUE str2)
 {
-    long len;
+    int len;
     rb_encoding *enc;
     char *p1, *p1end, *p2, *p2end;
 
@@ -2102,7 +2088,7 @@
     }
     else {
 	while (p1 < p1end && p2 < p2end) {
-	    unsigned int c1 = rb_enc_codepoint(p1, p1end, enc);
+	    unsigned int c1 = rb_enc_codepoint_l(p1, p1end, &len, enc);
 	    unsigned int c2 = rb_enc_codepoint(p2, p2end, enc);
 
 	    if (c1 != c2) {
@@ -2111,7 +2097,6 @@
 		if (c1 > c2) return INT2FIX(1);
 		if (c1 < c2) return INT2FIX(-1);
 	    }
-	    len = rb_enc_codelen(c1, enc);
 	    p1 += len;
 	    p2 += len;
 	}
@@ -3876,8 +3861,7 @@
         }
         n = MBCLEN_CHARFOUND_LEN(n);
 
-	c = rb_enc_codepoint(p, pend, enc);
-	n = rb_enc_codelen(c, enc);
+	c = rb_enc_codepoint_l(p, pend, &n, enc);
 
 	p += n;
 	if (c == '"'|| c == '\\' ||
@@ -4089,14 +4073,15 @@
     }
     else {
 	while (s < send) {
-	    unsigned int c = rb_enc_codepoint(s, send, enc);
+	    int clen;
+	    unsigned int c = rb_enc_codepoint_l(s, send, &clen, enc);
 
 	    if (rb_enc_islower(c, enc)) {
 		/* assuming toupper returns codepoint with same size */
 		rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
 		modify = 1;
 	    }
-	    s += rb_enc_codelen(c, enc);
+	    s += clen;
 	}
     }
 
@@ -4160,14 +4145,15 @@
     }
     else {
 	while (s < send) {
-	    unsigned int c = rb_enc_codepoint(s, send, enc);
+	    int clen;
+	    unsigned int c = rb_enc_codepoint_l(s, send, &clen, enc);
 
 	    if (rb_enc_isupper(c, enc)) {
 		/* assuming tolower returns codepoint with same size */
 		rb_enc_mbcput(rb_enc_tolower(c, enc), s, enc);
 		modify = 1;
 	    }
-	    s += rb_enc_codelen(c, enc);
+	    s += clen;
 	}
     }
 
@@ -4220,25 +4206,26 @@
     int modify = 0;
     unsigned int c;
     int cr = ENC_CODERANGE(str);
+    int clen;
 
     rb_str_modify(str);
     enc = STR_ENC_GET(str);
     if (RSTRING_LEN(str) == 0 || !RSTRING_PTR(str)) return Qnil;
     s = RSTRING_PTR(str); send = RSTRING_END(str);
 
-    c = rb_enc_codepoint(s, send, enc);
+    c = rb_enc_codepoint_l(s, send, &clen, enc);
     if (rb_enc_islower(c, enc)) {
 	rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
 	modify = 1;
     }
-    s += rb_enc_codelen(c, enc);
+    s += clen;
     while (s < send) {
-	c = rb_enc_codepoint(s, send, enc);
+	c = rb_enc_codepoint_l(s, send, &clen, enc);
 	if (rb_enc_isupper(c, enc)) {
 	    rb_enc_mbcput(rb_enc_tolower(c, enc), s, enc);
 	    modify = 1;
 	}
-	s += rb_enc_codelen(c, enc);
+	s += clen;
     }
 
     ENC_CODERANGE_SET(str, cr);
@@ -4306,7 +4293,8 @@
     }
     else {
 	while (s < send) {
-	    unsigned int c = rb_enc_codepoint(s, send, enc);
+	    int clen;
+	    unsigned int c = rb_enc_codepoint_l(s, send, &clen, enc);
 
 	    if (rb_enc_isupper(c, enc)) {
 		/* assuming toupper returns codepoint with same size */
@@ -4318,7 +4306,7 @@
 		rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
 		modify = 1;
 	    }
-	    s += rb_enc_codelen(c, enc);
+	    s += clen;
 	}
     }
 
@@ -4359,19 +4347,21 @@
 static unsigned int
 trnext(struct tr *t, rb_encoding *enc)
 {
+    int len;
+
     for (;;) {
 	if (!t->gen) {
 	    if (t->p == t->pend) return -1;
 	    if (t->p < t->pend - 1 && *t->p == '\\') {
 		t->p++;
 	    }
-	    t->now = rb_enc_codepoint(t->p, t->pend, enc);
-	    t->p += rb_enc_codelen(t->now, enc);
+	    t->now = rb_enc_codepoint_l(t->p, t->pend, &len, enc);
+	    t->p += len;
 	    if (t->p < t->pend - 1 && *t->p == '-') {
 		t->p++;
 		if (t->p < t->pend) {
-		    unsigned int c = rb_enc_codepoint(t->p, t->pend, enc);
-		    t->p += rb_enc_codelen(c, enc);
+		    unsigned int c = rb_enc_codepoint_l(t->p, t->pend, &len, enc);
+		    t->p += len;
 		    if (t->now > c) continue;
 		    t->gen = 1;
 		    t->max = c;
@@ -4490,8 +4480,8 @@
 	char *buf = ALLOC_N(char, max), *t = buf;
 
 	while (s < send) {
-	    c0 = c = rb_enc_codepoint(s, send, enc);
-	    tlen = clen = rb_enc_codelen(c, enc);
+	    c0 = c = rb_enc_codepoint_l(s, send, &clen, enc);
+	    tlen = clen;
 
 	    s += clen;
 	    if (c < 256) {
@@ -4557,8 +4547,8 @@
 	char *buf = ALLOC_N(char, max), *t = buf;
 
 	while (s < send) {
-	    c0 = c = rb_enc_codepoint(s, send, enc);
-	    tlen = clen = rb_enc_codelen(c, enc);
+	    c0 = c = rb_enc_codepoint_l(s, send, &clen, enc);
+	    tlen = clen;
 
 	    if (c < 256) {
 		c = trans[c];
@@ -4764,8 +4754,8 @@
     if (!s || RSTRING_LEN(str) == 0) return Qnil;
     send = RSTRING_END(str);
     while (s < send) {
-	unsigned int c = rb_enc_codepoint(s, send, enc);
-	int clen = rb_enc_codelen(c, enc);
+	int clen;
+	unsigned int c = rb_enc_codepoint_l(s, send, &clen, enc);
 
 	if (tr_find(c, squeez, del, nodel)) {
 	    modify = 1;
@@ -4867,8 +4857,7 @@
 		s++;
 	    }
 	    else {
-		c = rb_enc_codepoint(s, send, enc);
-		clen = rb_enc_codelen(c, enc);
+		c = rb_enc_codepoint_l(s, send, &clen, enc);
 
 		if (c != save || (argc > 0 && !tr_find(c, squeez, del, nodel))) {
 		    if (t != s) rb_enc_mbcput(c, t, enc);
@@ -5008,8 +4997,7 @@
 	    s++;
 	}
 	else {
-	    c = rb_enc_codepoint(s, send, enc);
-	    clen = rb_enc_codelen(c, enc);
+	    c = rb_enc_codepoint_l(s, send, &clen, enc);
 	    if (tr_find(c, table, del, nodel)) {
 		i++;
 	    }
@@ -5131,11 +5119,12 @@
 	char *bptr = ptr;
 	int skip = 1;
 	unsigned int c;
+	int clen;
 
 	end = beg;
 	while (ptr < eptr) {
-	    c = rb_enc_codepoint(ptr, eptr, enc);
-	    ptr += rb_enc_mbclen(ptr, eptr, enc);
+	    c = rb_enc_codepoint_l(ptr, eptr, &clen, enc);
+	    ptr += clen;
 	    if (skip) {
 		if (rb_enc_isspace(c, enc)) {
 		    beg = ptr - bptr;
@@ -5362,13 +5351,12 @@
     }
 
     while (p < pend) {
-	unsigned int c = rb_enc_codepoint(p, pend, enc);
+	unsigned int c = rb_enc_codepoint_l(p, pend, &n, enc);
 
       again:
-	n = rb_enc_codelen(c, enc);
 	if (rslen == 0 && c == newline) {
 	    p += n;
-	    if (p < pend && (c = rb_enc_codepoint(p, pend, enc)) != newline) {
+	    if (p < pend && (c = rb_enc_codepoint_l(p, pend, &n, enc)) != newline) {
 		goto again;
 	    }
 	    while (p < pend && rb_enc_codepoint(p, pend, enc) == newline) {
@@ -5715,10 +5703,11 @@
     e = t = RSTRING_END(str);
     /* remove spaces at head */
     while (s < e) {
-	unsigned int cc = rb_enc_codepoint(s, e, enc);
+	int clen;
+	unsigned int cc = rb_enc_codepoint_l(s, e, &clen, enc);
 	
 	if (!rb_enc_isspace(cc, enc)) break;
-	s += rb_enc_codelen(cc, enc);
+	s += clen;
     }
 
     if (s > RSTRING_PTR(str)) {
@@ -5787,7 +5776,8 @@
 	while (s < t && rb_enc_isspace(*(t-1), enc)) t--;
     } else {
         while (s < e) {
-	    unsigned int cc = rb_enc_codepoint(s, e, enc);
+	    int clen;
+	    unsigned int cc = rb_enc_codepoint_l(s, e, &clen, enc);
 
 	    if (!cc || rb_enc_isspace(cc, enc)) {
 	        if (!space_seen) t = s;
@@ -5796,7 +5786,7 @@
 	    else {
 	        space_seen = Qfalse;
 	    }
-	    s += rb_enc_codelen(cc, enc);
+	    s += clen;
 	}
 	if (!space_seen) t = s;
     }

In This Thread

Prev Next