From: foonlyboy@... Date: 2018-10-12T18:44:59+00:00 Subject: [ruby-core:89391] [Ruby trunk Bug#15210] UTF-8 BOM should be removed from String in internal representation Issue #15210 has been updated by foonlyboy (Eike Dierks). I looked into it a bit more closely into it: io.c does this in ~~~ c static int io_strip_bom(VALUE io) ~~~ which is called by: ~~~ c static void io_set_encoding_by_bom(VALUE io) ~~~ > It is documented at `IO.new`, and you can use it at `CSV.open` too. Yes, I was aware of this. I also agree the the conversion has to take place at opening the file. But with rails I get a ActionDispatch::Http::UploadedFile (which returns an ASCII-8BIT byte stream) And I could find no way to apply the io_strip_bom() to it, not even by going through StringIO. (but then Ruby is not about applying tricks anyway) It sounds to me that nobu also agrees, that the BOM should always be removed. > If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" I don't care so much about this for now. (while I can imagine this to happen when concatenating files ...) But let's fix the more simple problems first. I think the BOM is used for two reasons in byte streams: - a magic number for UTF encoded data (which might even apply to UTF-8) - a magic number to distinguish different UTF byte orderings when using UTF-16, UTF-32, UTF-36? But in the ruby world, we have **String** We should remove all artefacts from any external encoding. Impact: I believe this might need a lot of changes throughout more than just one place in the code, but I believe this should be fully upward compatible with *most* customers code. This should still agree with the ruby spec, because nowhere was it ever declared that String keeps the BOM. --- Please excuse my lengthy writings, but I thought these encoding problems were a thing from the past. We might also look at the other languages around. Makes for a good rosetta code ... ~eike ---------------------------------------- Bug #15210: UTF-8 BOM should be removed from String in internal representation https://bugs.ruby-lang.org/issues/15210#change-74431 * Author: foonlyboy (Eike Dierks) * Status: Open * Priority: Normal * Assignee: docs * Target version: * ruby -v: * Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN ---------------------------------------- Hi everyone working on the ruby trunk, I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data. We import some CSV from paypal. They now include a BOM in front of their UTF-8 encoded CSV data. This BOM is making some troubles. I believe this to be a bug in how byte data is converted to the ruby internal String representation. There is a workaround, but this needs to be documented: ```ruby IO.read(mode:'r:BOM|UTF-8') ``` --- But I'm asking for to improve the UTF-BOM handling: - The BOM is only used for transfer encoding at the byte stream level. - The BOM MUST NOT be part of the String in internal representation. --- BTW: stdlib::CSV chokes on the BOM I'd like to add some code for a workaround: ```ruby class String # delete UTF Byte Order Mark from string # returns self (even if no bom was found, contrary to delete_prefix!) # NOTE: use with care: better remove the bom when reading the file def delete_bom! raise 'encoding is not UTF-8' unless self.encoding == Encoding::UTF_8 delete_prefix!("\xEF\xBB\xBF") return self end # returns a copy of string with UTF Byte Order Mark deleted from string def delete_bom dup.delete_bom! end end ``` --- ~eike -- https://bugs.ruby-lang.org/ Unsubscribe: