[#70843] Re: [ruby-cvs:58952] hsbt:r51801 (trunk): * lib/rubygems: Update to RubyGems HEAD(fe61e4c112). — Eric Wong <normalperson@...>
hsbt@ruby-lang.org wrote:
3 messages
2015/09/17
[ruby-core:70807] [Ruby trunk - Bug #11522] URI::decode returns incorrectly encoding strings
From:
duerst@...
Date:
2015-09-15 02:40:36 UTC
List:
ruby-core #70807
Issue #11522 has been updated by Martin Dürst. Nobuyoshi Nakada wrote: > It has no hints for encoding. In theory, that's correct. In practice, there are several better possibilities. 1) We can add an additional parameter that indicates the encoding. 2) We can default to UTF-8. That's because most URIs that contain non-ASCII byte values these days are based on UTF-8, and their percentage is increasing steadily. 3) We can check whether using UTF-8 makes sense or not. If the bytes are valid UTF-8, then the chance that they are anything else than UTF-8 is virtually 0. 1) and 2) are already done by CGI.unescape. But 3) isn't. Also, CGI.unescape changes '+' to ' ', which is desirable in some contexts (query parts in http(s) URIs), but not in others (e.g. mailto URIs). ---------------------------------------- Bug #11522: URI::decode returns incorrectly encoding strings https://bugs.ruby-lang.org/issues/11522#change-54190 * Author: Charlie Anderson * Status: Rejected * Priority: Normal * Assignee: akira yamada * ruby -v: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux] * Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN ---------------------------------------- When given unicode characters to encode and decode, the URI module returns a string with an invalid encoding. ~~~ irb(main):026:0* unicode = 'œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬' => "œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬" irb(main):027:0> unicode.encoding => #<Encoding:UTF-8> irb(main):028:0> unicode.valid_encoding? => true irb(main):029:0> encoded = URI::encode(unicode) => "%C5%93%C2%B4%C3%A5%E2%88%91%C2%AE%C2%B4%C3%9F%E2%88%82%E2%80%A0%E2%89%88%C2%A9%C6%92%C3%A7%CB%99%C2%A9%E2%88%9A%E2%88%86%CB%99%E2%88%AB%CB%9A%E2%88%86~%C2%AC" irb(main):030:0> encoded.encoding => #<Encoding:US-ASCII> irb(main):031:0> encoded.valid_encoding? => true irb(main):032:0> decoded = URI::decode(encoded) => "\xC5\x93\xC2\xB4\xC3\xA5\xE2\x88\x91\xC2\xAE\xC2\xB4\xC3\x9F\xE2\x88\x82\xE2\x80\xA0\xE2\x89\x88\xC2\xA9\xC6\x92\xC3\xA7\xCB\x99\xC2\xA9\xE2\x88\x9A\xE2\x88\x86\xCB\x99\xE2\x88\xAB\xCB\x9A\xE2\x88\x86~\xC2\xAC" irb(main):033:0> decoded.encoding => #<Encoding:US-ASCII> irb(main):034:0> decoded.valid_encoding? => false ~~~ I would expect decoded to have a valid encoding - probably as UTF-8? -- https://bugs.ruby-lang.org/