ruby-dev

[#47212] [ruby-trunk - Bug #8203][Assigned] Rinda: recycled object — "naruse (Yui NARUSE)" <naruse@...>

[#47214] [ruby-trunk - Feature #3946][Closed] Array#packのqQ指定子に機種依存サイズフラグ!を追加 — "akr (Akira Tanaka)" <akr@...>

1 message 2013/04/02

[#47215] [ruby-trunk - Bug #8209][Open] mkmf.log contains misplaced messages — "akr (Akira Tanaka)" <akr@...>

1 message 2013/04/03

[#47217] [ruby-trunk - Feature #8214][Open] デッドロックチェックに前スレッドのバックトレースダンプの追加 — "kosaki (Motohiro KOSAKI)" <kosaki.motohiro@...>

6 messages 2013/04/03

[#47268] [ruby-trunk - Feature #8214] デッドロックチェックに全スレッドのバックトレースダンプの追加 — "nobu (Nobuyoshi Nakada)" <nobu@...> 2013/04/19

[#49639] [Ruby trunk Feature#8214] デッドロックチェックに全スレッドのバックトレースダンプの追加 — naruse@... 2016/06/01

Issue #8214 has been updated by Yui NARUSE.

[#49640] [Ruby trunk Feature#8214] デッドロックチェックに全スレッドのバックトレースダンプの追加 — naruse@... 2016/06/01

Issue #8214 has been updated by Yui NARUSE.

[#49641] [Ruby trunk Feature#8214] デッドロックチェックに全スレッドのバックトレースダンプの追加 — tagomoris@... 2016/06/02

Issue #8214 has been updated by Satoshi TAGOMORI.

[#49662] [Ruby trunk Feature#8214][Assigned] デッドロックチェックに全スレッドのバックトレースダンプの追加 — shyouhei@... 2016/06/13

Issue #8214 has been updated by Shyouhei Urabe.

[#47223] [ruby-trunk - Feature #4195][Rejected] option for Socket#sendmsg — "akr (Akira Tanaka)" <akr@...>

1 message 2013/04/04

[#47225] [ruby-trunk - Feature #7877] E::Lazy#with_index needed — "zzak (Zachary Scott)" <zachary@...>

1 message 2013/04/05

[#47226] [ruby-trunk - Feature #7877] E::Lazy#with_index needed — duerst (Martin Dürst) <duerst@...>

1 message 2013/04/05

[#47229] [ruby-trunk - Bug #8227][Open] addr2line.c compile error on Solaris since r39887 — "ngoto (Naohisa Goto)" <ngotogenome@...>

3 messages 2013/04/06

[#47235] [Backport 200 - Backport #8227] addr2line.c compile error on Solaris since r39887 — "naruse (Yui NARUSE)" <naruse@...> 2013/04/06

[#47242] [Backport 200 - Backport #8227] addr2line.c compile error on Solaris since r39887 — "ngoto (Naohisa Goto)" <ngotogenome@...> 2013/04/10

[#47231] [ruby-trunk - Bug #8228][Open] atomic_ops unavailable on Solaris 9 or earlier — "ngoto (Naohisa Goto)" <ngotogenome@...>

2 messages 2013/04/06

[#47243] [Backport 200 - Backport #8228] atomic_ops unavailable on Solaris 9 or earlier — "ngoto (Naohisa Goto)" <ngotogenome@...> 2013/04/10

[#47232] [ruby-trunk - Bug #3348] rubyspec: Kernel.spawn redirects both STDERR and STDOUT to the given name ERROR — "akr (Akira Tanaka)" <akr@...>

1 message 2013/04/06

[#47233] [ruby-trunk - Feature #3348] rubyspec: Kernel.spawn redirects both STDERR and STDOUT to the given name ERROR — "kosaki (Motohiro KOSAKI)" <kosaki.motohiro@...>

1 message 2013/04/06

[#47236] [ruby-trunk - Feature #3348] rubyspec: Kernel.spawn redirects both STDERR and STDOUT to the given name ERROR — "naruse (Yui NARUSE)" <naruse@...>

1 message 2013/04/07

[#47237] [Backport 200 - Backport #8234][Open] Please backport r40102 — "yugui (Yuki Sonoda)" <yugui@...>

3 messages 2013/04/08

[#47259] [Backport 200 - Backport #8234][Assigned] Please backport r40102 — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...> 2013/04/13

[#47261] [Backport 200 - Backport #8234] Please backport r40102 — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...> 2013/04/13

[#47238] [Backport 200 - Backport #8235][Open] Please backport r40103 — "yugui (Yuki Sonoda)" <yugui@...>

2 messages 2013/04/08

[#47260] [Backport 200 - Backport #8235][Assigned] Please backport r40103 — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...> 2013/04/13

[#47245] [ruby-trunk - Bug #8251][Open] Windowsにおいて、drbのテストでteardown時のkillに失敗することがある — "usa (Usaku NAKAMURA)" <usa@...>

2 messages 2013/04/11

[#47531] [ruby-trunk - Bug #8251] Windowsにおいて、drbのテストでteardown時のkillに失敗することがある — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...> 2013/07/20

[#47246] [ruby-trunk - Bug #8252][Assigned] cgiのHTML tag makerに未定義の属性を渡した場合の挙動 — "naruse (Yui NARUSE)" <naruse@...>

2 messages 2013/04/11

[#47248] [ruby-trunk - Bug #8252] cgiのHTML tag makerに未定義の属性を渡した場合の挙動 — "xibbar (Takeyuki FUJIOKA)" <xibbar@...> 2013/04/11

[#47247] [ruby-trunk - Bug #953] 深い入れ子の配列の取り扱いで落ちる — "naruse (Yui NARUSE)" <naruse@...>

1 message 2013/04/11

[#47249] [ruby-trunk - Bug #8256][Open] dependency to include/ruby/version.h — "akr (Akira Tanaka)" <akr@...>

2 messages 2013/04/11

[#47250] [ruby-trunk - Bug #8256] dependency to include/ruby/version.h — "akr (Akira Tanaka)" <akr@...> 2013/04/11

[#47253] unexpected dependencies such as ext/-test-/num2int/depend describes num2int.o depends on numeric.c — Tanaka Akira <akr@...>

気がついたんですが、ext/-test-/num2int/depend など、

3 messages 2013/04/12

[#47257] Re: unexpected dependencies such as ext/-test-/num2int/depend describes num2int.o depends on numeric.c — KOSAKI Motohiro <kosaki.motohiro@...> 2013/04/12

2013/4/12 Tanaka Akira <akr@fsij.org>:

[#47258] Re: unexpected dependencies such as ext/-test-/num2int/depend describes num2int.o depends on numeric.c — Tanaka Akira <akr@...> 2013/04/13

2013年4月13日 5:33 KOSAKI Motohiro <kosaki.motohiro@gmail.com>:

[#47254] [Backport 200 - Backport #8260][Assigned] backport r40260 (non-symbol key is not a keyword argument) — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...>

2 messages 2013/04/12

[#47275] [Backport 200 - Backport #8260] backport r40260 (non-symbol key is not a keyword argument) — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...> 2013/04/19

[#47255] Etc.passwd の挙動について — fujioka <fuj@...>

xibbarこと藤岡です。

4 messages 2013/04/12

[#47256] Re: Etc.passwd の挙動について — fujioka <fuj@...> 2013/04/12

藤岡です。

[#47265] Re: Etc.passwd の挙動について — "Akinori MUSHA" <knu@...> 2013/04/15

At Fri, 12 Apr 2013 23:39:17 +0900,

[#47281] Re: Etc.passwd の挙動について — Kazuhiro NISHIYAMA <zn@...> 2013/04/22

西山和広です。

[#47262] [Backport 200 - Backport #8266][Open] Backport r40216 (fiddle's mprotect) — "naruse (Yui NARUSE)" <naruse@...>

2 messages 2013/04/14

[#47263] [Backport 200 - Backport #8266][Assigned] Backport r40216 (fiddle's mprotect) — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...> 2013/04/14

[#47264] [Backport93 - Backport #8267][Assigned] Please backport r40304 to avoid invalid SSH_shutdown() — "shugo (Shugo Maeda)" <redmine@...>

2 messages 2013/04/15

[#47266] [Backport93 - Backport #8267] Please backport r40304 to avoid invalid SSL_shutdown() — "znz (Kazuhiro NISHIYAMA)" <redmine@...> 2013/04/16

[#47269] [ruby-trunk - Bug #8292][Open] README.EXT.ja の Data_Wrap_Struct の所の文章がコード片と一致しない (patch) — "metanest (Makoto Kishimoto)" <redmine@...>

4 messages 2013/04/19

[#47273] [ruby-trunk - Bug #8292] README.EXT.ja の Data_Wrap_Struct の所の文章がコード片と一致しない (patch) — "kou (Kouhei Sutou)" <kou@...> 2013/04/19

[#47276] Re: [ruby-trunk - Bug #8292] README.EXT.ja の Data_Wrap_Struct の所の文章がコード片と一致しない (patch) — Zachary Scott <zachary@...> 2013/04/19

Sorry I missed this one!

[#47270] [ruby-trunk - Bug #8292] README.EXT.ja の Data_Wrap_Struct の所の文章がコード片と一致しない (patch) — "metanest (Makoto Kishimoto)" <redmine@...> 2013/04/19

[#47271] [ruby-trunk - Feature #8295][Open] Float や Rational から（可能であれば）正確な BigDecimal を生成する機能 — "metanest (Makoto Kishimoto)" <redmine@...>

7 messages 2013/04/19

[#47286] [ruby-trunk - Feature #8295][Rejected] Float や Rational から（可能であれば）正確な BigDecimal を生成する機能 — "mrkn (Kenta Murata)" <muraken@...> 2013/04/23

[#47288] [ruby-trunk - Feature #8295] Float や Rational から（可能であれば）正確な BigDecimal を生成する機能 — "metanest (Makoto Kishimoto)" <redmine@...> 2013/04/24

[#47290] [ruby-trunk - Feature #8295] Float や Rational から（可能であれば）正確な BigDecimal を生成する機能 — "mrkn (Kenta Murata)" <muraken@...> 2013/04/24

[#47292] [ruby-trunk - Feature #8295] Float や Rational から（可能であれば）正確な BigDecimal を生成する機能 — "metanest (Makoto Kishimoto)" <redmine@...> 2013/04/25

[#47272] [ruby-trunk - Feature #8295] Float や Rational から（可能であれば）正確な BigDecimal を生成する機能 — "naruse (Yui NARUSE)" <naruse@...> 2013/04/19

[#47289] [ruby-trunk - Feature #8295][Assigned] Float や Rational から（可能であれば）正確な BigDecimal を生成する機能 — "mrkn (Kenta Murata)" <muraken@...> 2013/04/24

[#47277] [ruby-trunk - Bug #8301][Open] REXML::Attributes#to_a — "ohai (Ippei Obayashi)" <redmine@...>

3 messages 2013/04/20

[#47283] [ruby-trunk - Bug #8301] REXML::Attributes#to_a — "hsbt (Hiroshi SHIBATA)" <shibata.hiroshi@...> 2013/04/22

[#47294] [ruby-trunk - Bug #8301] REXML::Attributes#to_a — "kou (Kouhei Sutou)" <kou@...> 2013/04/26

[#47278] [ruby-trunk - Bug #8302][Open] REXML::Text の entity_filter が有効でない — "ohai (Ippei Obayashi)" <redmine@...>

3 messages 2013/04/20

[#47296] [ruby-trunk - Bug #8302] REXML::Text の entity_filter が有効でない — "kou (Kouhei Sutou)" <kou@...> 2013/04/26

[#47282] [ruby-trunk - Bug #8302] REXML::Text の entity_filter が有効でない — "hsbt (Hiroshi SHIBATA)" <shibata.hiroshi@...> 2013/04/22

[#47280] [ruby-trunk - Bug #8304][Open] follow RDP floats_imprecise URL changed (patch) — "metanest (Makoto Kishimoto)" <redmine@...>

1 message 2013/04/21

[#47284] [Backport 200 - Backport #8311][Assigned] backport r40182 (fix a curses test failure with EIO) — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...>

2 messages 2013/04/23

[#47285] [Backport 200 - Backport #8311] backport r40182 (fix a curses test failure with EIO) — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...> 2013/04/23

[#47291] [ruby-trunk - Feature #8324][Open] Net::Telnet.new のオプション — "znz (Kazuhiro NISHIYAMA)" <redmine@...>

7 messages 2013/04/25

[#49763] [Ruby trunk Feature#8324][Rejected] Net::Telnet.new のオプション — hsbt@... 2016/08/17

Issue #8324 has been updated by Hiroshi SHIBATA.

[#47307] [ruby-trunk - Feature #8324] Net::Telnet.new のオプション — "hsbt (Hiroshi SHIBATA)" <shibata.hiroshi@...> 2013/04/29

[#47308] [ruby-trunk - Feature #8324] Net::Telnet.new のオプション — "naruse (Yui NARUSE)" <naruse@...> 2013/04/29

[#47309] [ruby-trunk - Feature #8324] Net::Telnet.new のオプション — "hsbt (Hiroshi SHIBATA)" <shibata.hiroshi@...> 2013/04/29

[#47333] [ruby-trunk - Feature #8324] Net::Telnet.new のオプション — "znz (Kazuhiro NISHIYAMA)" <redmine@...> 2013/05/08

[#47357] [ruby-trunk - Feature #8324] Net::Telnet.new のオプション — "ayumin (Ayumu AIZAWA)" <ayumu.aizawa@...> 2013/05/19

[#47293] [ruby-trunk - Feature #8331][Open] Update config.guess and config.sub for AArch64 (ARM64) — "akr (Akira Tanaka)" <akr@...>

4 messages 2013/04/26

[#47739] [ruby-trunk - Feature #8331] Update config.guess and config.sub for AArch64 (ARM64) — "vo.x (Vit Ondruch)" <v.ondruch@...> 2013/10/01

[#47740] [ruby-trunk - Feature #8331][Closed] Update config.guess and config.sub for AArch64 (ARM64) — "vo.x (Vit Ondruch)" <v.ondruch@...> 2013/10/01

[#47360] [ruby-trunk - Feature #8331] Update config.guess and config.sub for AArch64 (ARM64) — "vo.x (Vit Ondruch)" <v.ondruch@...> 2013/05/21

[#47295] [Backport 200 - Backport #8332][Assigned] backport r40476 (fix compilation error on cross compile for ARM) — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...>

1 message 2013/04/26

[#47297] [Backport 200 - Backport #8333][Assigned] backport r40479 (fix a compilation error on platform seekdir() isn't available) — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...>

2 messages 2013/04/26

[#47299] [Backport 200 - Backport #8333][Rejected] backport r40479 (fix a compilation error on platform seekdir() isn't available) — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...> 2013/04/26

[#47298] [Backport 200 - Backport #8334][Assigned] backport r40478 (fix a compilation error on platform seekdir() isn't available) — "nagachika (Tomoyuki Chikanaga)" <nagachika00@...>

1 message 2013/04/26

[#47301] [ruby-trunk - Feature #8338][Open] compilation failure in nkf with Bionic (Android's libc) — "akr (Akira Tanaka)" <akr@...>

3 messages 2013/04/27

[#47303] [ruby-trunk - Feature #8338] compilation failure in nkf with Bionic (Android's libc) — "akr (Akira Tanaka)" <akr@...> 2013/04/27

[#47302] [ruby-trunk - Feature #8338][Assigned] compilation failure in nkf with Bionic (Android's libc) — "naruse (Yui NARUSE)" <naruse@...> 2013/04/27

[#47310] RubySource.com からのインタビュー — Yusuke Endoh <mame@...>

遠藤です。ご無沙汰してます。

6 messages 2013/04/30

[#47326] Re: RubySource.com からのインタビュー — Yusuke Endoh <mame@...> 2013/05/07

遠藤です。

[#47329] Re: RubySource.com からのインタビュー — KOSAKI Motohiro <kosaki.motohiro@...> 2013/05/07

> でも、コアコミッタってこんな少ないんですかね？

[#47330] Re: Re: RubySource.com からのインタビュー — Zachary Scott <zachary@...> 2013/05/08

Anyone want to translate this one?

[#47331] Re: RubySource.com からのインタビュー — KOSAKI Motohiro <kosaki.motohiro@...> 2013/05/08

> Anyone want to translate this one?

[#47335] Re: RubySource.com からのインタビュー — Yusuke Endoh <mame@...> 2013/05/10

遠藤です。

[ruby-dev:47240] [ruby-trunk - Feature #6752] Replacing ill-formed subsequencce

From: "naruse (Yui NARUSE)" <naruse@...>

Date: 2013-04-08 19:38:13 UTC

List: ruby-dev #47240

Issue #6752 has been updated by naruse (Yui NARUSE).


duerst (Martin Dürst) wrote:
> I have thought about this a bit. Yui's patch to string treats this as a problem separat from transcoding. I think it is preferable to use the transcoding logic to implement this. The advantage is that exactly the same options and fallbacks can be used, and if we add a new option or fallback to transcode, it will be usable, too.

This method doesn't need same options and fallbacks.
It need only invalid related, doesn't need undef related.
Moreover transcoder is usable only if Ruby has related transcoder of the target encoding.
But Ruby has some encodings which doesn't have transcoder for example emacs-mule.
Therefore this can't be built on transcoder.

> Some more notes: The checks for converting from one encoding to the same encoding are in str_transcode0. Anywhere else? We need some data to drive the conversion, but this should be easy to generate, and will be the same for many 8-bit encodings.

Yeah, I came to str_transcode0 and it is correct place.

The date we need is problem.
transcode doesn't have all the data though tool/transcode-tblgen.rb has some base data.
The only one which has all the data we need is enc/*.

> It will be easy to catch invalid byte sequences, but I'm not sure it's worth to check unassigned codepoints, at least not in Unicode.

If we need unassigned codepoints, we must define encodings more strictly.
Even if it is Unicode, it needs versions.
I don't think it's worth to check.
----------------------------------------
Feature #6752: Replacing ill-formed subsequencce
https://bugs.ruby-lang.org/issues/6752#change-38370

Author: naruse (Yui NARUSE)
Status: Assigned
Priority: Normal
Assignee: matz (Yukihiro Matsumoto)
Category: core
Target version: next minor


=begin
== 概要
Stringになんらかの理由で不正なバイト列が含まれている時に、それを置換文字で置き換えたい。

== ユースケース
実際に確認されているユースケースは以下の通りです。
* twitterのtitle
* IRCのログ
* ニコニコ動画の API
* Webクローリング
これらの不正なバイト列の生成過程は、おそらく、バイト単位で文字列を切り詰めた時に末尾が切れて、
末尾がおかしい不正な文字列が作られます。（前二者）
これをコンテナに入れたり結合することによって、途中にも混ざった文字列が作られます。（後二者）

* https://twitter.com/takahashim/status/18974040397
* https://twitter.com/n0kada/status/215674740705210368
* https://twitter.com/n0kada/status/215686490070585346
* https://twitter.com/hajimehoshi/status/215671146769682432
* http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/
* http://stackoverflow.com/questions/2982677/ruby-1-9-invalid-byte-sequence-in-utf-8

== 必要な引数: 置換文字
省略可能、String。
デフォルトは、Unicode系ならU+FFFD、それ以外では「?」。
デフォルトが空文字でない理由は、削除してしまうことで、従来は存在しなかったトークンを作れてしまい、
上位のレイヤーの脆弱性に繋がるからです。
http://unicode.org/reports/tr36/#UTF-8_Exploit

== API
--- str.encode(str.encoding, invalid: replace, [replace: "〓"])
* CSI的じゃなくて気持ち悪い
* iconv でできるのは glibc iconv か GNU libiconv に //IGNORE つけた時で他はできない
* 実装上のメリットは後述の通り、直感に反してあまりない(と思う)

== 別メソッド
* 新しいメソッドである
* fix/repair invalid/illegal bytes/sequence あたりの名前か

== 実装
=== 鬼車ベース
int ret = rb_enc_precise_mbclen(p, e, enc); して、
MBCLEN_INVALID_P(ret) が真な時、何バイト目が不正なのかわからないのが微妙。
ONIGENC_CONSTRUCT_MBCLEN_INVALID() がバイト数を取らないのが原因なので、
鬼車のエンコーディングモジュール全てに影響してしまうため、修正困難。
不正なバイトはほとんど存在しないと仮定して、効率を犠牲にすれば回避は可能。

=== transcodeベース
UCS正規化なglibc iconv, GNU libiconv, Perl Encodeなどと違って、
CSIなtranscodeでは、自分自身に変換する場合、
エンコーディングごとに「何もしない」変換モジュールを用意しないといけない。


とりあえず鬼車ベースのコンセプト実装とテストを添付しておきます。

 diff --git a/string.c b/string.c
 index d038835..4808f15 100644
 --- a/string.c
 +++ b/string.c
 @@ -7426,6 +7426,199 @@ rb_str_ellipsize(VALUE str, long len)
      return ret;
  }
  
 +/*
 + *  call-seq:
 + *    str.fix_invalid -> new_str
 + *
 + *  If the string is well-formed, it returns self.
 + *  If the string has invalid byte sequence, repair it with given replacement
 + *  character.
 + */
 +VALUE
 +rb_str_fix_invalid(VALUE str)
 +{
 +    int cr = ENC_CODERANGE(str);
 +    rb_encoding *enc;
 +    if (cr == ENC_CODERANGE_7BIT || cr == ENC_CODERANGE_VALID)
 +	return rb_str_dup(str);
 +
 +    enc = STR_ENC_GET(str);
 +    if (rb_enc_asciicompat(enc)) {
 +	const char *p = RSTRING_PTR(str);
 +	const char *e = RSTRING_END(str);
 +	const char *p1 = p;
 +	/* 10 should be enough for the usual use case,
 +	 * fixing a wrongly chopped character at the end of the string
 +	 */
 +	long room = 10;
 +	VALUE buf = rb_str_buf_new(RSTRING_LEN(str) + room);
 +	const char *rep;
 +	if (enc == rb_utf8_encoding())
 +	    rep = "\xEF\xBF\xBD";
 +	else
 +	    rep = "?";
 +	cr = ENC_CODERANGE_7BIT;
 +
 +	p = search_nonascii(p, e);
 +	if (!p) {
 +	    p = e;
 +	}
 +	while (p < e) {
 +	    int ret = rb_enc_precise_mbclen(p, e, enc);
 +	    if (MBCLEN_CHARFOUND_P(ret)) {
 +		if ((unsigned char)*p > 127) cr = ENC_CODERANGE_VALID;
 +		p += MBCLEN_CHARFOUND_LEN(ret);
 +	    }
 +	    else if (MBCLEN_INVALID_P(ret)) {
 +		const char *q;
 +		long clen = rb_enc_mbmaxlen(enc);
 +		if (p > p1) rb_str_buf_cat(buf, p1, p - p1);
 +		q = RSTRING_END(buf);
 +
 +		if (e - p < clen) clen = e - p;
 +		if (clen < 3) {
 +		    clen = 1;
 +		}
 +		else {
 +		    long len = RSTRING_LEN(buf);
 +		    clen--;
 +		    rb_str_buf_cat(buf, p, clen);
 +		    for (; clen > 1; clen--) {
 +			ret = rb_enc_precise_mbclen(q, q + clen, enc);
 +			if (MBCLEN_NEEDMORE_P(ret)) {
 +			    break;
 +			}
 +			else if (MBCLEN_INVALID_P(ret)) {
 +			    continue;
 +			}
 +			else {
 +			    rb_bug("shouldn't reach here '%s'", q);
 +			}
 +		    }
 +		    rb_str_set_len(buf, len);
 +		}
 +		p += clen;
 +		p1 = p;
 +		rb_str_buf_cat2(buf, rep);
 +		p = search_nonascii(p, e);
 +		if (!p) {
 +		    p = e;
 +		    break;
 +		}
 +	    }
 +	    else if (MBCLEN_NEEDMORE_P(ret)) {
 +		break;
 +	    }
 +	    else {
 +		rb_bug("shouldn't reach here");
 +	    }
 +	}
 +	if (p1 < p) {
 +	    rb_str_buf_cat(buf, p1, p - p1);
 +	}
 +	if (p < e) {
 +	    rb_str_buf_cat2(buf, rep);
 +	    cr = ENC_CODERANGE_VALID;
 +	}
 +	ENCODING_CODERANGE_SET(buf, rb_enc_to_index(enc), cr);
 +	return buf;
 +    }
 +    else if (rb_enc_dummy_p(enc)) {
 +	return rb_str_dup(str);
 +    }
 +    else {
 +	/* ASCII incompatible */
 +	const char *p = RSTRING_PTR(str);
 +	const char *e = RSTRING_END(str);
 +	const char *p1 = p;
 +	/* 10 should be enough for the usual use case,
 +	 * fixing a wrongly chopped character at the end of the string
 +	 */
 +	long room = 10;
 +	VALUE buf = rb_str_buf_new(RSTRING_LEN(str) + room);
 +	const char *rep;
 +	long mbminlen = rb_enc_mbminlen(enc);
 +	static rb_encoding *utf16be;
 +	static rb_encoding *utf16le;
 +	static rb_encoding *utf32be;
 +	static rb_encoding *utf32le;
 +	if (!utf16be) {
 +	    utf16be = rb_enc_find("UTF-16BE");
 +	    utf16le = rb_enc_find("UTF-16LE");
 +	    utf32be = rb_enc_find("UTF-32BE");
 +	    utf32le = rb_enc_find("UTF-32LE");
 +	}
 +	if (enc == utf16be) {
 +	    rep = "\xFF\xFD";
 +	}
 +	else if (enc == utf16le) {
 +	    rep = "\xFD\xFF";
 +	}
 +	else if (enc == utf32be) {
 +	    rep = "\x00\x00\xFF\xFD";
 +	}
 +	else if (enc == utf32le) {
 +	    rep = "\xFD\xFF\x00\x00";
 +	}
 +	else {
 +	    rep = "?";
 +	}
 +
 +	while (p < e) {
 +	    int ret = rb_enc_precise_mbclen(p, e, enc);
 +	    if (MBCLEN_CHARFOUND_P(ret)) {
 +		p += MBCLEN_CHARFOUND_LEN(ret);
 +	    }
 +	    else if (MBCLEN_INVALID_P(ret)) {
 +		const char *q;
 +		long clen = rb_enc_mbmaxlen(enc);
 +		if (p > p1) rb_str_buf_cat(buf, p1, p - p1);
 +		q = RSTRING_END(buf);
 +
 +		if (e - p < clen) clen = e - p;
 +		if (clen < mbminlen * 3) {
 +		    clen = mbminlen;
 +		}
 +		else {
 +		    long len = RSTRING_LEN(buf);
 +		    clen -= mbminlen;
 +		    rb_str_buf_cat(buf, p, clen);
 +		    for (; clen > mbminlen; clen-=mbminlen) {
 +			ret = rb_enc_precise_mbclen(q, q + clen, enc);
 +			if (MBCLEN_NEEDMORE_P(ret)) {
 +			    break;
 +			}
 +			else if (MBCLEN_INVALID_P(ret)) {
 +			    continue;
 +			}
 +			else {
 +			    rb_bug("shouldn't reach here '%s'", q);
 +			}
 +		    }
 +		    rb_str_set_len(buf, len);
 +		}
 +		p += clen;
 +		p1 = p;
 +		rb_str_buf_cat2(buf, rep);
 +	    }
 +	    else if (MBCLEN_NEEDMORE_P(ret)) {
 +		break;
 +	    }
 +	    else {
 +		rb_bug("shouldn't reach here");
 +	    }
 +	}
 +	if (p1 < p) {
 +	    rb_str_buf_cat(buf, p1, p - p1);
 +	}
 +	if (p < e) {
 +	    rb_str_buf_cat2(buf, rep);
 +	}
 +	ENCODING_CODERANGE_SET(buf, rb_enc_to_index(enc), ENC_CODERANGE_VALID);
 +	return buf;
 +    }
 +}
 +
  /**********************************************************************
   * Document-class: Symbol
   *
 @@ -7882,6 +8075,7 @@ Init_String(void)
      rb_define_method(rb_cString, "getbyte", rb_str_getbyte, 1);
      rb_define_method(rb_cString, "setbyte", rb_str_setbyte, 2);
      rb_define_method(rb_cString, "byteslice", rb_str_byteslice, -1);
 +    rb_define_method(rb_cString, "fix_invalid", rb_str_fix_invalid, 0);
  
      rb_define_method(rb_cString, "to_i", rb_str_to_i, -1);
      rb_define_method(rb_cString, "to_f", rb_str_to_f, 0);
 diff --git a/test/ruby/test_string.rb b/test/ruby/test_string.rb
 index 47f349c..2b0cfeb 100644
 --- a/test/ruby/test_string.rb
 +++ b/test/ruby/test_string.rb
 @@ -2031,6 +2031,29 @@ class TestString < Test::Unit::TestCase
  
      assert_equal(u("\x82")+("\u3042"*9), ("\u3042"*10).byteslice(2, 28))
    end
 +
 +  def test_fix_invalid
 +    assert_equal("\uFFFD\uFFFD\uFFFD", "\x80\x80\x80".fix_invalid)
 +    assert_equal("\uFFFDA", "\xF4\x80\x80A".fix_invalid)
 +
 +    # exapmles in Unicode 6.1.0 D93b
 +    assert_equal("\x41\uFFFD\uFFFD\x41\uFFFD\x41",
 +                 "\x41\xC0\xAF\x41\xF4\x80\x80\x41".fix_invalid)
 +    assert_equal("\x41\uFFFD\uFFFD\uFFFD\x41",
 +                 "\x41\xE0\x9F\x80\x41".fix_invalid)
 +    assert_equal("\u0061\uFFFD\uFFFD\uFFFD\u0062\uFFFD\u0063\uFFFD\uFFFD\u0064",
 +                 "\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64".fix_invalid)
 +
 +    assert_equal("abcdefghijklmnopqrstuvwxyz\u0061\uFFFD\uFFFD\uFFFD\u0062\uFFFD\u0063\uFFFD\uFFFD\u0064",
 +                 "abcdefghijklmnopqrstuvwxyz\x61\xF1\x80\x80\xE1\x80\xC2\x62\x80\x63\x80\xBF\x64".fix_invalid)
 +
 +    assert_equal("\uFFFD\u3042".encode("UTF-16BE"),
 +                 "\xD8\x00\x30\x42".force_encoding(Encoding::UTF_16BE).
 +                 fix_invalid)
 +    assert_equal("\uFFFD\u3042".encode("UTF-16LE"),
 +                 "\x00\xD8\x42\x30".force_encoding(Encoding::UTF_16LE).
 +                 fix_invalid)
 +  end
  end
  
  class TestString2 < TestString
=end



-- 
http://bugs.ruby-lang.org/

Thread

Prev Next

In This Thread

Prev Next