From: "neocoin (Sangmin Ryu)" Date: 2013-04-10T00:07:44+09:00 Subject: [ruby-core:54144] [ruby-trunk - Bug #8241] If uri host-part has underscore ( '_' ), 'URI#parse' raise 'URI::InvalidURIError' Issue #8241 has been updated by neocoin (Sangmin Ryu). naruse (Yui NARUSE) wrote: > neocoin (Sangmin Ryu) wrote: > > naruse (Yui NARUSE) wrote: > > > uri.rb is currently based on RFC 2373, and planning fix based on URL spec. > > > http://url.spec.whatwg.org/ > > > > Thank for feedback. > > > > 'rfc2373' is just ip v6 addressing part. This doen't include whole URI definition. > > ( http://tools.ietf.org/html/rfc2373 ) > > > > So rfc3986 based comment in uri/common.rb is right. Check plz. > > Oops, it is RFC 2396. http://www.ietf.org/rfc/rfc2396.txt > > And on RFC 2396, host of http scheme is defined on 3.2.2. Server-based Naming Authority. > It says > > server = [ [ userinfo "@" ] hostport ] > userinfo = *( unreserved | escaped | > ";" | ":" | "&" | "=" | "+" | "$" | "," ) > hostport = host [ ":" port ] > host = hostname | IPv4address > hostname = *( domainlabel "." ) toplabel [ "." ] > domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum > toplabel = alpha | alpha *( alphanum | "-" ) alphanum Yes, you are right. I checked rfc2396 (published in Aug 1998) too through commented 'uri/common.rb'. That document is URI general syntax starting point. And in January 2005, rfc 3986 was published by rfc 2396 co-author. (See also http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Refinement_of_specifications ) As result, rfc3986 is current standard I think, many web service companies (ex - ddns or private address for blog company) use rfc3986 to be standard. When I make a web crawler with ruby, second level domain ( google.com 's 'google' part) generally don't have a underscore and tild. I know, DNS hosting service don't permit underscore at second level domain. But many third domains have underscore character. ( hello_world.google.com 's 'hello_world' part). So I check URI spec in rfc3986 several years ago and post this issue. Find below string in http://tools.ietf.org/html/rfc3986#appendix-A Appendix A. Collected ABNF for URI ... host = IP-literal / IPv4address / reg-name ... reg-name = *( unreserved / pct-encoded / sub-delims ) ... unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" See also. Python urlparse method include rfc3986 http://docs.python.org/2/library/urlparse.html ---------------------------------------- Bug #8241: If uri host-part has underscore ( '_' ), 'URI#parse' raise 'URI::InvalidURIError' https://bugs.ruby-lang.org/issues/8241#change-38397 Author: neocoin (Sangmin Ryu) Status: Open Priority: Normal Assignee: akira (akira yamada) Category: core Target version: ruby -v: ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin11.4.2] First of all, I say 'I am sorry', if this issue making activity is rude. I don't know, where do I put this simple and critical issue. This problem was found a long time ago (1 or 2 years ). But problem is very clear and solution very simple. So I wait just long time with monkey patch. If uri host-part has underscore ( '_' ), 'URI#parse' raise 'URI::InvalidURIError' ex) =begin >require 'uri' >URI.parse 'http://test_strin.helo.com' URI::InvalidURIError: the scheme http does not accept registry part: test_strin.helo.com (or bad hostname?) from ... /.rbenv/versions/1.9.3-p125/lib/ruby/1.9.1/uri/generic.rb:213:in `initialize' > > > e=URI.parse('http://test_string.hello.com') rescue $! => # > puts e.backtrace .../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/generic.rb:214:in `initialize' .../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/http.rb:84:in `initialize' .../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `new' .../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `parse' .../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:747:in `parse' vs >URI.parse('http://teststring.hello.com') ># =end This problem is made by hostname regex pattern of 'URI#split ' in uri/common.rb https://bugs.ruby-lang.org/projects/ruby-trunk/repository/entry/lib/uri/common.rb#L368 ( https://github.com/ruby/ruby/blob/trunk/lib/uri/common.rb#L368 ) =begin [26] pry(main)> URI.split('http://teststring.hello.com') => ["http", nil, "teststring.hello.com", nil, nil, "", nil, nil, nil] // normal [27] pry(main)> URI.split('http://test_string.hello.com') => ["http", nil, nil, nil, "test_string.hello.com", "", nil, nil, nil] // wrong source position. https://bugs.ruby-lang.org/projects/ruby-trunk/repository/entry/lib/uri/common.rb#L368 ( https://github.com/ruby/ruby/blob/trunk/lib/uri/common.rb#L368 ) =begin # hostname = *( domainlabel "." ) toplabel [ "." ] # reg-name = *( unreserved / pct-encoded / sub-delims ) # RFC3986 unless hostname ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-.]|%\\h\\h)+" end =end Through you could check source comment, 'reg-name' in rfc3986 could be 'unreserved / pct-encoded / sub-delims )' And 'unreserved' definition in rfc3986 ( http://tools.ietf.org/html/rfc3986#section-2.3 ) > unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" But hostname regex pattern has just '-' and '.' except '_' and '~'. Please, check rfc3986 and add hostname pattern for reg-name like below. =begin ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-._~]|%\\h\\h)+" =end -- http://bugs.ruby-lang.org/