[ruby-core:93095] [Ruby trunk Bug#15210] UTF-8 BOM should be removed from String in internal representation
From:
nobu@...
Date:
2019-06-13 07:24:44 UTC
List:
ruby-core #93095
Issue #15210 has been updated by nobu (Nobuyoshi Nakada).
https://github.com/nobu/ruby/pull/new/feature/15210-detect_bom
----------------------------------------
Bug #15210: UTF-8 BOM should be removed from String in internal representation
https://bugs.ruby-lang.org/issues/15210#change-78517
* Author: foonlyboy (Eike Dierks)
* Status: Open
* Priority: Normal
* Assignee: docs
* Target version:
* ruby -v:
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
Hi everyone working on the ruby trunk,
I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data.
We import some CSV from paypal.
They now include a BOM in front of their UTF-8 encoded CSV data.
This BOM is making some troubles.
I believe this to be a bug in how byte data is converted to the ruby internal String representation.
There is a workaround, but this needs to be documented:
```ruby
IO.read(mode:'r:BOM|UTF-8')
```
---
But I'm asking for to improve the UTF-BOM handling:
- The BOM is only used for transfer encoding at the byte stream level.
- The BOM MUST NOT be part of the String in internal representation.
---
BTW: stdlib::CSV chokes on the BOM
I'd like to add some code for a workaround:
```ruby
class String
# delete UTF Byte Order Mark from string
# returns self (even if no bom was found, contrary to delete_prefix!)
# NOTE: use with care: better remove the bom when reading the file
def delete_bom!
raise 'encoding is not UTF-8' unless self.encoding == Encoding::UTF_8
delete_prefix!("\xEF\xBB\xBF")
return self
end
# returns a copy of string with UTF Byte Order Mark deleted from string
def delete_bom
dup.delete_bom!
end
end
```
---
~eike
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>