From: "dim (Dimitrij Denissenko)" Date: 2021-10-04T08:41:44+00:00 Subject: [ruby-core:105536] [Ruby master Bug#18238] CSV encoding issue with parsing from Zlib::GzipReader stream Issue #18238 has been reported by dim (Dimitrij Denissenko). ---------------------------------------- Bug #18238: CSV encoding issue with parsing from Zlib::GzipReader stream https://bugs.ruby-lang.org/issues/18238 * Author: dim (Dimitrij Denissenko) * Status: Open * Priority: Normal * ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux] * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN ---------------------------------------- Hi, I found an issue with parsing CSVs directly from a `Zlib::GzipReader` IO which I am trying to debug. Unfortunately, I am not at liberty to share the (proprietary) CSV file and I couldn't recreate the issue with a simplified/obfuscated version, but maybe you can point me in the right direction. Here's what's happening: ``` CSV::VERSION # => "3.1.9" File.open("file.csv.gz", encoding: 'binary') do |io| Zlib::GzipReader.wrap(io) do |rio| CSV.new(rio).count end end ``` Results in: ``` ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:346:in `rescue in parse': Invalid byte sequence in UTF-8 in line 38424. (CSV::MalformedCSVError) from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:329:in `parse' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each' ... ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:237:in `read_chunk': CSV::Parser::InvalidEncoding (CSV::Parser::InvalidEncoding) from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:157:in `scan_all' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:1009:in `parse_quoted_column_value' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:962:in `parse_column_value' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:886:in `parse_quotable_robust' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:864:in `block in parse_quotable_loose' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:127:in `block in each_line' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:103:in `each_line' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:103:in `each_line' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:825:in `parse_quotable_loose' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:336:in `parse' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each' from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each' from (irb):3:in `count' ``` While the following succeeds: ``` File.open("file.csv", 'w', encoding: 'binary') do |wio| File.open("file.csv.gz", encoding: 'binary') do |io| Zlib::GzipReader.wrap(io) do |rio| IO.copy_stream rio, wio end end end File.open("file.csv") do |rio| CSV.new(rio).count end ``` I have narrowed it down to https://github.com/ruby/csv/blob/v3.1.9/lib/csv/parser.rb#L235-L237, it looks like reading the chunk truncates the string at an UTF8 character and `chunk.valid_encoding?` therefore results in false. -- https://bugs.ruby-lang.org/ Unsubscribe: