From: amadanmath@...
Date: 2018-06-01T22:45:16+00:00
Subject: [ruby-core:87349] [Ruby trunk Bug#14804] GzipReader cannot read Freebase dump (but gzcat/zless can)

Issue #14804 has been updated by amadan (Goran Topic).


(Note that `f.each_line.count` would return the wrong result anyway, due to https://bugs.ruby-lang.org/issues/14805 , since 3130753066 is outside int32 range, but it doesn't have the chance to do so, on account of stopping prematurely.)

----------------------------------------
Bug #14804: GzipReader cannot read Freebase dump (but gzcat/zless can)
https://bugs.ruby-lang.org/issues/14804#change-72338

* Author: amadan (Goran Topic)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: Ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin17]
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
This is likely related to https://stackoverflow.com/questions/35354951/gzipstream-quietly-fails-on-large-file-stream-ends-at-2gb (and its accepted answer).

The file in question: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz
(watch out, it's 30Gb compressed!)

Steps to reproduce:

    require "zlib"
    Zlib::GzipReader.open("freebase-rdf-latest.gz") { |f| f.each_line.count }
    # => 14374340

However, the correct answer is different:

    $ gzcat freebase-rdf-latest.gz | wc -l
    3130753066

Another experiment showed that the last `f.tell` was `1945715682`, while there's considerably more bytes in the uncompressed version. This fits well with the Stack Overflow report from C# linked above, which states the first "substream" contains exactly that many bytes.

If this is a hard constraint from the wrapped library (and thus should be fixed upstream), at least the documentation should mention it.


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>