From: "sam.saffron (Sam Saffron)" Date: 2022-08-16T00:34:09+00:00 Subject: [ruby-core:109488] [Ruby master Feature#18822] Ruby lack a proper method to percent-encode strings for URIs (RFC 3986) Issue #18822 has been updated by sam.saffron (Sam Saffron). Since we just finished working around a nightmare scenario here @byroot, I think it is rather instructive to see a real world problem The problem: > You get something, that is probably a URL from somewhere and need to be able to make requests to it. - It can have a unicode domain that needs to run through and IDN converter - It can have unicode chars that need percent encoding - It can be unescaped, or it can be escaped (and in weird case part escaped) Ideally you want to normalize as well, so caching is "stronger" and does not break for identical URLs. So we ended up with this monster and travesty, partly powered by base classes, partly powered by addressable, 100% hack. https://github.com/discourse/discourse/blob/main/lib/url_helper.rb#L72-L105 ---------------------------------------- Feature #18822: Ruby lack a proper method to percent-encode strings for URIs (RFC 3986) https://bugs.ruby-lang.org/issues/18822#change-98656 * Author: byroot (Jean Boussier) * Status: Open * Priority: Normal ---------------------------------------- ### Context There are two fairly similar encoding methods that are often confused. `application/x-www-form-urlencoded` which is how form data is encoded, and "percent-encoding" as defined by [RFC 3986](https://www.rfc-editor.org/rfc/rfc3986). AFAIK, the only way they differ is that "form encoding" escape space characters as `+`, and RFC 3986 escape them as `%20`. Most of the time it doesn't matter, but sometimes it does. ### Ruby form and URL escape methods - `URI.escape(" ") # => "%20"` but it was deprecated and removed (in 3.0 ?). - `ERB::Util.url_encode(" ") # => "%20"` but it's implemented with a `gsub` and isn't very performant. It's also awkward to have to reach for `ERB` - `CGI.escape(" ") # => "+"` - `URI.encode_www_form_component(" ") # => "+"` ### Unescape methods For unescaping, it's even more of a clear cut since `URI.unescape` was removed. So there's no available method that won't treat an unescaped `+` as simply `+`. e.g. in Javascript: `decodeURIComponent("foo+bar") #=> "foo+bar"`. If one were to use `CGI.unescape`, the string might be improperly decoded: `GI.unescape("foo+bar") #=> "foo bar"`. ### Other languages - Javascript `encodeURI` and `encodeURIComponent` use `%20`. - PHP has `urlencode` using `+` and `rawurlencode` using `%20`. - Python has `urllib.parse.quote` using `%20` and `urllib.parse.quote_plus` using `+`. ### Proposal Since `CGI` already have a very performant encoder for `application/x-www-form-urlencoded`, I think it would make sense that it would provide another method for RFC3986. I propose: - `CGI.url_encode(" ") # => "%20"` - Or `CGI.encode_url`. - Alias `CGI.escape` as `GCI.encode_www_form_component` - Clarify the documentation of `CGI.escape`. -- https://bugs.ruby-lang.org/ Unsubscribe: