From: "Martin J. Dürst" Date: 2012-11-06T15:06:56+09:00 Subject: [ruby-core:48963] Re: [ruby-trunk - Bug #7282][Open] Invalid UTF-8 from emoji allowed through silently Hello Charles, On 2012/11/06 11:51, headius (Charles Nutter) wrote: > > Issue #7282 has been reported by headius (Charles Nutter). > > ---------------------------------------- > Bug #7282: Invalid UTF-8 from emoji allowed through silently > https://bugs.ruby-lang.org/issues/7282 > > Author: headius (Charles Nutter) > Status: Open > Priority: Normal > Assignee: > Category: > Target version: > ruby -v: 2.0.0 > > > On my system, where the default encoding is UTF-8, the following should not parse: > > ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' It doesn't. It should be ruby-2.0.0 -e 'p "Hello, \x96 world!"}' or ruby-2.0.0 -e 'p "Hello, \x96 world!\"}"' or ruby-2.0.0 -e 'p "Hello, \x96 world!"' or some such. But apart from that, you are right. I'm no longer sure, but I think at some point, there was an argument to allow \x in UTF-8 literals, and a reason to not check. But I can't remember what, and if we can't remember, when we'd better make it check. > But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: > system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' > "{\"sample\": \"Hello, \x96 world!\"}" Encoding to the encoding you're already in is a no-op. See also https://bugs.ruby-lang.org/issues/6321. > Nor does character-walking: > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' > Hello, ? world! > > Nor does []: > system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' > "\x96" The underlying machinery is the same. > But the malformed String does get caught by transcoding to UTF-16: > system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' > -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) > from -e:1:in `
' Yes, here you're actually transcoding, so this is checked. > Or by doing a simple regexp match: > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' > -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) > from -e:1:in `match' > from -e:1:in `
' We'd need to dig in the code to figure out why it happens here. > And of course I am ignoring the fact that it should never have parsed to begin with. > > This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. > > JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed. Overall, the idea (I think) is to hit a balance between efficiency and correctness. But checking at parsing time would probably be rather efficient at avoiding errors. Regards, Martin.