From: shyouhei@... Date: 2019-12-06T02:00:13+00:00 Subject: [ruby-core:96125] [Ruby master Bug#16402] UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8" Issue #16402 has been updated by shyouhei (Shyouhei Urabe). Assignee set to nahi (Hiroshi Nakamura) Status changed from Feedback to Third Party's Issue PikachuEXE (Pikachu Leung) wrote: > Thanks for your answer > But I actually encounter this when processing text input from remote data source > And would not be using `File.read` Well that's... complicated. There are lots of debates as to how to know network remote content's content type. Though not a direct answer, issue #2567 can be interesting to read. Not sure but maybe HTTPClient provides a way to specify encoding. Can you ask the author? ---------------------------------------- Bug #16402: UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8" https://bugs.ruby-lang.org/issues/16402#change-82984 * Author: PikachuEXE (Pikachu Leung) * Status: Third Party's Issue * Priority: Normal * Assignee: nahi (Hiroshi Nakamura) * Target version: * ruby -v: ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin18] * Backport: 2.5: UNKNOWN, 2.6: UNKNOWN ---------------------------------------- ``` shell $ ruby -e 'File.binwrite("u.txt", "\xff\xfe\x00\x01")' $ file u.txt u.txt: Little-endian UTF-16 Unicode text, with no line terminators $ ruby -e 'p /\w+/.match?(File.read("u.txt"))' Traceback (most recent call last): 1: from -e:1:in `
' -e:1:in `match?': invalid byte sequence in UTF-8 (ArgumentError) ``` No error should be raised, just like when comparing with string without BOM ``` shell $ ruby -e 'p /\w+/.match?(File.read("u.txt")[2..-1])' false ``` -- https://bugs.ruby-lang.org/ Unsubscribe: