From: "naruse (Yui NARUSE)" <naruse@...> Date: 2013-02-18T20:53:46+09:00 Subject: [ruby-core:52437] [ruby-trunk - RubySpec #7282] Invalid UTF-8 from emoji allowed through silently Issue #7282 has been updated by naruse (Yui NARUSE). Tracker changed from Bug to RubySpec headius (Charles Nutter) wrote: > duerst (Martin D��rst) wrote: > > > Nor does character-walking: > > > > > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' > > > Hello, ? world! > > > > > > Nor does []: > > > > > system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' > > > "\x96" > > > > The underlying machinery is the same. > > Makes sense. JRuby also allows these cases through. Perhaps both cases should fail once they encounter a non-7bit, non-surrogate byte like \x96? On string index access, Ruby doesn't raise error even if it is invalid byte sequence. > > > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' > > > -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) > > > from -e:1:in `match' > > > from -e:1:in `<main>' > > > > We'd need to dig in the code to figure out why it happens here. > > Well, at the very least it would have to be using the encoding subsystem for Oniguruma/Onigmo to walk characters; that logic almost certainly rejects \x96. On regexp match, Ruby raises error. ---------------------------------------- RubySpec #7282: Invalid UTF-8 from emoji allowed through silently https://bugs.ruby-lang.org/issues/7282#change-36498 Author: headius (Charles Nutter) Status: Assigned Priority: Normal Assignee: naruse (Yui NARUSE) Category: M17N Target version: 2.0.0 On my system, where the default encoding is UTF-8, the following should not parse: ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" Nor does character-walking: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! Nor does []: system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]' " " system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]' " " But the malformed String does get caught by transcoding to UTF-16: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' Or by doing a simple regexp match: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' And of course I am ignoring the fact that it should never have parsed to begin with. This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed. -- http://bugs.ruby-lang.org/