From: merch-redmine@... Date: 2020-07-22T15:40:32+00:00 Subject: [ruby-core:99268] [Ruby master Bug#14804] GzipReader cannot read Freebase dump (but gzcat/zless can) Issue #14804 has been updated by jeremyevans0 (Jeremy Evans). Status changed from Open to Closed This can now be handled using `Zlib::GzipReader.zcat`, which was recently added to zlib. ---------------------------------------- Bug #14804: GzipReader cannot read Freebase dump (but gzcat/zless can) https://bugs.ruby-lang.org/issues/14804#change-86655 * Author: amadan (Goran Topic) * Status: Closed * Priority: Normal * ruby -v: Ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin17] * Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN ---------------------------------------- This is likely related to https://stackoverflow.com/questions/35354951/gzipstream-quietly-fails-on-large-file-stream-ends-at-2gb (and its accepted answer). The file in question: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz (watch out, it's 30Gb compressed!) Steps to reproduce: require "zlib" Zlib::GzipReader.open("freebase-rdf-latest.gz") { |f| f.each_line.count } # => 14374340 However, the correct answer is different: $ gzcat freebase-rdf-latest.gz | wc -l 3130753066 Another experiment showed that the last `f.tell` was `1945715682`, while there's considerably more bytes in the uncompressed version. This fits well with the Stack Overflow report from C# linked above, which states the first "substream" contains exactly that many bytes. If this is a hard constraint from the wrapped library (and thus should be fixed upstream), at least the documentation should mention it. -- https://bugs.ruby-lang.org/ Unsubscribe: