From: cohencarlisle+bugs.ruby-lang@... Date: 2019-01-19T03:30:02+00:00 Subject: [ruby-core:91167] [Ruby trunk Bug#15517] Net::HTTP not recognizing valid UTF-8 Issue #15517 has been updated by cohen (Cohen Carlisle). I'm not sure I think this is exactly the same as https://bugs.ruby-lang.org/issues/2567, as that one has focused on using the HTTP headers to guess the content type. Here I'm pointing out that ASCII-only strings are recognized as UTF8, but valid, multi-byte UTF8 strings are not recognized as UTF8 encoded. I suppose the trouble is that checking if the string is a valid UTF8 encoded string is not trivial, but other core/stdlib functions, like File.read seem to perform this. ---------------------------------------- Bug #15517: Net::HTTP not recognizing valid UTF-8 https://bugs.ruby-lang.org/issues/15517#change-76395 * Author: cohen (Cohen Carlisle) * Status: Open * Priority: Normal * Assignee: * Target version: * ruby -v: 2.6.0 * Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN ---------------------------------------- I created a case at https://github.com/Cohen-Carlisle/utf8app that shows Net::HTTP labeling a response body as ASCII-8BIT encoded because it contains a non-ascii character (specifically, the double prime symbol: ���), but recognizing ascii-only strings as UTF-8 encoded. The example is live on heroku but because it's a free dyno, it will go to sleep and take a while to start up the first time it is hit after a while. As explained there, I would expect response body strings with the double prime symbol to still have an encoding of UTF-8 since they are valid UTF-8. The README from the repo (which shows the behavior) is reproduced below: The purpose of this app is to demonstrate unexpected behavior in Ruby's net/http library. Valid UTF-8 response bodies are encoded as ASCII-8BIT, which apparently means Ruby is treating them as pure binary data, even when Content-Type headers label the body as UTF-8. In the example below, I would expect the response body to have UTF-8 encoding. Especially because when I copy and paste the body into a new string literal in my console, that string is UTF-8 encoded. ~~~ require 'net/http' uri = URI('https://utf8app.herokuapp.com') uri.path = '/utf8/example' res = Net::HTTP.get_response(uri) res['Content-Type'] # => "text/plain; charset=utf-8" puts res.body # The symbol for the inch unit of measurement is ���. res.body.encoding # => # res.body.ascii_only? # => false 'The symbol for the inch unit of measurement is ���.'.encoding # => # ~~~ We can demonstrate that the encoding issue is due to the non-ascii inches symbol by replacing it with a double quote instead. ~~~ uri.path = '/ascii/example' res = Net::HTTP.get_response(uri) res['Content-Type'] # => "text/plain; charset=utf-8" puts res.body # The symbol for the inch unit of measurement is ". res.body.encoding # => # res.body.ascii_only? # => true ~~~ Finally, as an extra WTF, JSON.parse recognizes the non-ascii characters as valid UTF-8 in a JSON example. ~~~ require 'json' uri.path = '/utf8/example_json' res = Net::HTTP.get_response(uri) res['Content-Type'] # => "application/json; charset=utf-8" puts res.body # {"feet":"���","inches":"���"} res.body.encoding # => # json = JSON.parse(res.body) # => {"feet"=>"���", "inches"=>"���"} json.values.map { |v| [v.encoding.to_s, v] } # => [["UTF-8", "���"], ["UTF-8", "���"]] ~~~ -- https://bugs.ruby-lang.org/ Unsubscribe: