From: nobu@...
Date: 2018-10-07T00:53:12+00:00
Subject: [ruby-core:89300] [Ruby trunk Bug#15210] UTF-8 BOM should be removed from String in internal representation

Issue #15210 has been updated by nobu (Nobuyoshi Nakada).

Description updated
Assignee set to docs

foonlyboy (Eike Dierks) wrote:
> I believe this to be a bug in how byte data is converted to the ruby internal String representation.

Yes, a BOM should be removed at the conversion, the reading from a data stream.

> There is a workaround, but this needs to be documented:
> ```ruby
> IO.read(mode:'r:BOM|UTF-8')
> ```

It is documented at `IO.new`, and you can use it at `CSV.open` too.

rdoc of `CSV.open`:
> You must pass a `filename` and may optionally add a `mode` for Ruby's `open()`.

rdoc of `Kernel.open`:
> See the documentation of `IO.new` for full documentation of the `mode` string directives.

rdoc of `IO.new`:
> If `"BOM|UTF-8"`, `"BOM|UTF-16LE"` or `"BOM|UTF16-BE"` are used, Ruby checks for
> a Unicode BOM in the input document to help determine the encoding.  For
> UTF-16 encodings the file open mode must be binary.  When present, the BOM
> is stripped and the external encoding from the BOM is used.  When the BOM
> is missing the given Unicode encoding is used as `ext_enc`.  (The BOM-set
> encoding option is case insensitive, so `"bom|utf-8"` is also valid.)

Documents improvement patches are welcome.

> But I'm asking for to improve the UTF-BOM handling:
> - The BOM is only used for transfer encoding at the byte stream level.

This is half true.

https://en.wikipedia.org/wiki/Byte_order_mark#Usage
> If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space"

The character at other place is not called as "BOM".


> - The BOM MUST NOT be part of the String in internal representation.

Yes, it should be removed at the reading, that is the only chance to remove a BOM properly.


----------------------------------------
Bug #15210: UTF-8 BOM should be removed from String in internal representation
https://bugs.ruby-lang.org/issues/15210#change-74333

* Author: foonlyboy (Eike Dierks)
* Status: Open
* Priority: Normal
* Assignee: docs
* Target version: 
* ruby -v: 
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
 Hi everyone working on the ruby trunk,

I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data.

We import some CSV from paypal.
They now include a BOM in front of their UTF-8 encoded CSV data.
This BOM is making some troubles.

I believe this to be a bug in how byte data is converted to the ruby internal String representation.

There is a workaround, but this needs to be documented:
```ruby
IO.read(mode:'r:BOM|UTF-8')
```


---

But I'm asking for to improve the UTF-BOM handling:
- The BOM is only used for transfer encoding at the byte stream level.
- The BOM MUST NOT be part of the String in internal representation.


---

BTW: stdlib::CSV chokes on the BOM

I'd like to add some code for a workaround:


```ruby
class String

    # delete UTF Byte Order Mark from string
    # returns self (even if no bom was found, contrary to delete_prefix!)
    # NOTE: use with care: better remove the bom when reading the file
    def delete_bom!
        raise 'encoding is not UTF-8' unless self.encoding == Encoding::UTF_8
        delete_prefix!("\xEF\xBB\xBF")
        return self
    end


    # returns a copy of string with UTF Byte Order Mark deleted from string
    def delete_bom
        dup.delete_bom!
    end

end
```

---
~eike


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>