From: merch-redmine@... Date: 2021-03-09T22:41:36+00:00 Subject: [ruby-core:102792] [Ruby master Feature#15517] Net::HTTP not recognizing valid UTF-8 Issue #15517 has been updated by jeremyevans0 (Jeremy Evans). Backport deleted (2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN) ruby -v deleted (2.6.0) Assignee set to naruse (Yui NARUSE) Status changed from Open to Assigned Tracker changed from Bug to Feature I've submitted a pull request (https://github.com/ruby/net-http/pull/17) that takes the patch provided by @naruse in #2567 and modifies it to be opt-in in a backwards compatible manner. It also fixes various issues with the patch and adds some basic tests. There are definitely cases that would not be handled correctly in terms of detecting content through meta tags, but since it is opt-in it should not break existing code. The current behavior is not considered a bug, so I'm switching this to a feature request, the same as #2567. ---------------------------------------- Feature #15517: Net::HTTP not recognizing valid UTF-8 https://bugs.ruby-lang.org/issues/15517#change-90814 * Author: cohen (Cohen Carlisle) * Status: Assigned * Priority: Normal * Assignee: naruse (Yui NARUSE) ---------------------------------------- I created a case at https://github.com/Cohen-Carlisle/utf8app that shows Net::HTTP labeling a response body as ASCII-8BIT encoded because it contains a non-ascii character (specifically, the double prime symbol: ���), but recognizing ascii-only strings as UTF-8 encoded. The example is live on heroku but because it's a free dyno, it will go to sleep and take a while to start up the first time it is hit after a while. As explained there, I would expect response body strings with the double prime symbol to still have an encoding of UTF-8 since they are valid UTF-8. The README from the repo (which shows the behavior) is reproduced below: The purpose of this app is to demonstrate unexpected behavior in Ruby's net/http library. Valid UTF-8 response bodies are encoded as ASCII-8BIT, which apparently means Ruby is treating them as pure binary data, even when Content-Type headers label the body as UTF-8. In the example below, I would expect the response body to have UTF-8 encoding. Especially because when I copy and paste the body into a new string literal in my console, that string is UTF-8 encoded. ~~~ require 'net/http' uri = URI('https://utf8app.herokuapp.com') uri.path = '/utf8/example' res = Net::HTTP.get_response(uri) res['Content-Type'] # => "text/plain; charset=utf-8" puts res.body # The symbol for the inch unit of measurement is ���. res.body.encoding # => # res.body.ascii_only? # => false 'The symbol for the inch unit of measurement is ���.'.encoding # => # ~~~ We can demonstrate that the encoding issue is due to the non-ascii inches symbol by replacing it with a double quote instead. ~~~ uri.path = '/ascii/example' res = Net::HTTP.get_response(uri) res['Content-Type'] # => "text/plain; charset=utf-8" puts res.body # The symbol for the inch unit of measurement is ". res.body.encoding # => # res.body.ascii_only? # => true ~~~ Finally, as an extra WTF, JSON.parse recognizes the non-ascii characters as valid UTF-8 in a JSON example. ~~~ require 'json' uri.path = '/utf8/example_json' res = Net::HTTP.get_response(uri) res['Content-Type'] # => "application/json; charset=utf-8" puts res.body # {"feet":"���","inches":"���"} res.body.encoding # => # json = JSON.parse(res.body) # => {"feet"=>"���", "inches"=>"���"} json.values.map { |v| [v.encoding.to_s, v] } # => [["UTF-8", "���"], ["UTF-8", "���"]] ~~~ -- https://bugs.ruby-lang.org/ Unsubscribe: