From: shevegen@... Date: 2018-10-06T19:51:04+00:00 Subject: [ruby-core:89299] [Ruby trunk Bug#15210] UTF-8 BOM should be removed from String in internal representation Issue #15210 has been updated by shevegen (Robert A. Heiler). > BTW: stdlib::CSV chokes on the BOM I can't say how common this is or whether there is a bug; but in the event that there may be, and the use case or situation involving the bug or faulty behaviour affecting other ruby hackers, I would agree in this event that CSV should probably be able to handle BOM-specific entries as well, in one way or another (be it automatic or via another API). I also agree that it could perhaps be mentioned somewhere, be it in the csv documentation or elsewhere. To the workaround: I assume you meant this only for a solution if others face a similar problem, rather than a permanent addition to class String, yes? (I ask this because adding a specific method to class String permanently in ruby may be much harder to do and get approved, whereas an extension to ruby's CSV is most likely easier and possible.) ---------------------------------------- Bug #15210: UTF-8 BOM should be removed from String in internal representation https://bugs.ruby-lang.org/issues/15210#change-74332 * Author: foonlyboy (Eike Dierks) * Status: Open * Priority: Normal * Assignee: * Target version: * ruby -v: * Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN ---------------------------------------- Hi everyone working on the ruby trunk, I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data. We import some CSV from paypal. They now include a BOM in front of their UTF-8 encoded CSV data. This BOM is making some troubles. I believe this to be a bug in how byte data is converted to the ruby internal String representation. There is a workaround, but this needs to be documented: `IO.read(mode:'r:BOM|UTF-8')` --- But I'm asking for to improve the UTF-BOM handling: - The BOM is only used for transfer encoding at the byte stream level. - The BOM MUST NOT be part of the String in internal representation. --- BTW: stdlib::CSV chokes on the BOM I'd like to add some code for a workaround: `class String # delete UTF Byte Order Mark from string # returns self (even if no bom was found, contrary to delete_prefix!) # NOTE: use with care: better remove the bom when reading the file def delete_bom! raise 'encoding is not UTF-8' unless self.encoding == Encoding::UTF_8 delete_prefix!("\xEF\xBB\xBF") return self end # returns a copy of string with UTF Byte Order Mark deleted from string def delete_bom dup.delete_bom! end end ` --- ~eike -- https://bugs.ruby-lang.org/ Unsubscribe: