From: matthew@... Date: 2019-06-18T09:24:12+00:00 Subject: [ruby-core:93220] [Ruby trunk Bug#15933] OpenURI: Assign default charset for HTTPS as well as HTTP Issue #15933 has been updated by phluid61 (Matthew Kerwin). gareth (Gareth Adams) wrote: > > The [IANA registry](https://www.iana.org/assignments/media-types/media-types.xhtml#text) isn't in a machine readable format, and so even if it were acceptable to depend on a gem like [mime-types-data](https://github.com/mime-types/mime-types-data) as a curated source of these values (I realise stdlib can't depend on gems), that data isn't currently available. The entire registry is available as [XML](https://www.iana.org/assignments/media-types/media-types.xml) and each individual registry is available as (ironically) text/csv; e.g. https://www.iana.org/assignments/media-types/text.csv That said, I agree in principle with pretty much everything else you've said. > It seems to me that changing the default to UTF-8 and extending the check to match "https" URIs is: > > * Correct in all cases except for a minuscule number of edge cases > * Compatible in all of those other cases > * Overridable by defining exceptions inline (as opposed to using a dependency like mime-types-data) if anyone raises issues with this default I would suggest ignoring the scheme altogether. Like: ```diff diff a/lib/open-uri.rb b/lib/open-uri.rb --- a/lib/open-uri.rb +++ b/lib/open-uri.rb @@ -552,7 +552,6 @@ def charset elsif block_given? yield - elsif type && %r{\Atext/} =~ type && - @base_uri && /\Ahttp\z/i =~ @base_uri.scheme - "iso-8859-1" # RFC2616 3.7.1 + elsif type && %r{\Atext/} =~ type + "utf-8" # RFC6838 4.2.1 else nil ``` Cheers ---------------------------------------- Bug #15933: OpenURI: Assign default charset for HTTPS as well as HTTP https://bugs.ruby-lang.org/issues/15933#change-78672 * Author: gareth (Gareth Adams) * Status: Assigned * Priority: Normal * Assignee: akr (Akira Tanaka) * Target version: * ruby -v: * Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN ---------------------------------------- Using `open-uri` to load a document in the following circumstances: * The `Content-Type` header is `text/*` and *doesn't* specify a charset, e.g. `Content-Type: text/csv` * The document is loaded from an `https://` URL ���will cause the resulting string to have `ASCII-8BIT` encoding. As the [documentation for OpenURI#charset](https://github.com/ruby/ruby/blob/trunk/lib/open-uri.rb#L538-L560) mentions, [RFC2616/3.7.1](https://tools.ietf.org/html/rfc2616#section-3.7.1) says: > When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. OpenURI takes this literally - only assigning ISO-8859-1 if `@base_uri.scheme` is *exactly* "http". This check was written [17 years ago](https://github.com/ruby/ruby/commit/3a20ed532b57da1e58287a5c53abe14400a085f4#diff-0f19cb99597e5fb90bfb937b22143b51R264) in 2002 even before TLS 1.1 was defined, and well before HTTPS was common. I believe this check should now also match the scheme "https". As [RFC2818/2](https://tools.ietf.org/html/rfc2818#section-2) says: > Conceptually, HTTP/TLS is very simple. Simply use HTTP over TLS precisely as you would use HTTP over TCP 1. Is this a suitable change to make? 2. I have a patch to fix the functionality (attached). What else do I need to specify in terms of documentation/tests? I'm happy to put more work into this, but it's my first contribution to Ruby core and I'd like some pointers. I've read through https://bugs.ruby-lang.org/projects/ruby/wiki/HowToReport ---Files-------------------------------- ruby-changes.patch (1.21 KB) -- https://bugs.ruby-lang.org/ Unsubscribe: