From: nobu@... Date: 2018-10-07T00:53:12+00:00 Subject: [ruby-core:89300] [Ruby trunk Bug#15210] UTF-8 BOM should be removed from String in internal representation Issue #15210 has been updated by nobu (Nobuyoshi Nakada). Description updated Assignee set to docs foonlyboy (Eike Dierks) wrote: > I believe this to be a bug in how byte data is converted to the ruby internal String representation. Yes, a BOM should be removed at the conversion, the reading from a data stream. > There is a workaround, but this needs to be documented: > ```ruby > IO.read(mode:'r:BOM|UTF-8') > ``` It is documented at `IO.new`, and you can use it at `CSV.open` too. rdoc of `CSV.open`: > You must pass a `filename` and may optionally add a `mode` for Ruby's `open()`. rdoc of `Kernel.open`: > See the documentation of `IO.new` for full documentation of the `mode` string directives. rdoc of `IO.new`: > If `"BOM|UTF-8"`, `"BOM|UTF-16LE"` or `"BOM|UTF16-BE"` are used, Ruby checks for > a Unicode BOM in the input document to help determine the encoding. For > UTF-16 encodings the file open mode must be binary. When present, the BOM > is stripped and the external encoding from the BOM is used. When the BOM > is missing the given Unicode encoding is used as `ext_enc`. (The BOM-set > encoding option is case insensitive, so `"bom|utf-8"` is also valid.) Documents improvement patches are welcome. > But I'm asking for to improve the UTF-BOM handling: > - The BOM is only used for transfer encoding at the byte stream level. This is half true. https://en.wikipedia.org/wiki/Byte_order_mark#Usage > If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" The character at other place is not called as "BOM". > - The BOM MUST NOT be part of the String in internal representation. Yes, it should be removed at the reading, that is the only chance to remove a BOM properly. ---------------------------------------- Bug #15210: UTF-8 BOM should be removed from String in internal representation https://bugs.ruby-lang.org/issues/15210#change-74333 * Author: foonlyboy (Eike Dierks) * Status: Open * Priority: Normal * Assignee: docs * Target version: * ruby -v: * Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN ---------------------------------------- Hi everyone working on the ruby trunk, I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data. We import some CSV from paypal. They now include a BOM in front of their UTF-8 encoded CSV data. This BOM is making some troubles. I believe this to be a bug in how byte data is converted to the ruby internal String representation. There is a workaround, but this needs to be documented: ```ruby IO.read(mode:'r:BOM|UTF-8') ``` --- But I'm asking for to improve the UTF-BOM handling: - The BOM is only used for transfer encoding at the byte stream level. - The BOM MUST NOT be part of the String in internal representation. --- BTW: stdlib::CSV chokes on the BOM I'd like to add some code for a workaround: ```ruby class String # delete UTF Byte Order Mark from string # returns self (even if no bom was found, contrary to delete_prefix!) # NOTE: use with care: better remove the bom when reading the file def delete_bom! raise 'encoding is not UTF-8' unless self.encoding == Encoding::UTF_8 delete_prefix!("\xEF\xBB\xBF") return self end # returns a copy of string with UTF Byte Order Mark deleted from string def delete_bom dup.delete_bom! end end ``` --- ~eike -- https://bugs.ruby-lang.org/ Unsubscribe: