[ruby-core:105537] [Ruby master Bug#18238] CSV encoding issue with parsing from Zlib::GzipReader stream
From:
"kou (Kouhei Sutou)" <noreply@...>
Date:
2021-10-04 08:43:09 UTC
List:
ruby-core #105537
Issue #18238 has been updated by kou (Kouhei Sutou).
Status changed from Open to Third Party's Issue
Could you open this on https://github.com/ruby/rss ? ruby/rss is the upstream of csv.
----------------------------------------
Bug #18238: CSV encoding issue with parsing from Zlib::GzipReader stream
https://bugs.ruby-lang.org/issues/18238#change-93993
* Author: dim (Dimitrij Denissenko)
* Status: Third Party's Issue
* Priority: Normal
* ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN
----------------------------------------
Hi,
I found an issue with parsing CSVs directly from a `Zlib::GzipReader` IO which I am trying to debug. Unfortunately, I am not at liberty to share the (proprietary) CSV file and I couldn't recreate the issue with a simplified/obfuscated version, but maybe you can point me in the right direction. Here's what's happening:
```
CSV::VERSION # => "3.1.9"
File.open("file.csv.gz", encoding: 'binary') do |io|
Zlib::GzipReader.wrap(io) do |rio|
CSV.new(rio).count
end
end
```
Results in:
```
~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:346:in `rescue in parse': Invalid byte sequence in UTF-8 in line 38424. (CSV::MalformedCSVError)
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:329:in `parse'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
...
~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:237:in `read_chunk': CSV::Parser::InvalidEncoding (CSV::Parser::InvalidEncoding)
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:157:in `scan_all'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:1009:in `parse_quoted_column_value'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:962:in `parse_column_value'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:886:in `parse_quotable_robust'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:864:in `block in parse_quotable_loose'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:127:in `block in each_line'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:103:in `each_line'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:103:in `each_line'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:825:in `parse_quotable_loose'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:336:in `parse'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
from (irb):3:in `count'
```
While the following succeeds:
```
File.open("file.csv", 'w', encoding: 'binary') do |wio|
File.open("file.csv.gz", encoding: 'binary') do |io|
Zlib::GzipReader.wrap(io) do |rio|
IO.copy_stream rio, wio
end
end
end
File.open("file.csv") do |rio|
CSV.new(rio).count
end
```
I have narrowed it down to https://github.com/ruby/csv/blob/v3.1.9/lib/csv/parser.rb#L235-L237, it looks like reading the chunk truncates the string at an UTF8 character and `chunk.valid_encoding?` therefore results in false.
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>