[ruby-core:96125] [Ruby master Bug#16402] UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"
From:
shyouhei@...
Date:
2019-12-06 02:00:13 UTC
List:
ruby-core #96125
Issue #16402 has been updated by shyouhei (Shyouhei Urabe).
Assignee set to nahi (Hiroshi Nakamura)
Status changed from Feedback to Third Party's Issue
PikachuEXE (Pikachu Leung) wrote:
> Thanks for your answer
> But I actually encounter this when processing text input from remote data source
> And would not be using `File.read`
Well that's... complicated. There are lots of debates as to how to know network remote content's content type. Though not a direct answer, issue #2567 can be interesting to read.
Not sure but maybe HTTPClient provides a way to specify encoding. Can you ask the author?
----------------------------------------
Bug #16402: UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"
https://bugs.ruby-lang.org/issues/16402#change-82984
* Author: PikachuEXE (Pikachu Leung)
* Status: Third Party's Issue
* Priority: Normal
* Assignee: nahi (Hiroshi Nakamura)
* Target version:
* ruby -v: ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin18]
* Backport: 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
``` shell
$ ruby -e 'File.binwrite("u.txt", "\xff\xfe\x00\x01")'
$ file u.txt
u.txt: Little-endian UTF-16 Unicode text, with no line terminators
$ ruby -e 'p /\w+/.match?(File.read("u.txt"))'
Traceback (most recent call last):
1: from -e:1:in `<main>'
-e:1:in `match?': invalid byte sequence in UTF-8 (ArgumentError)
```
No error should be raised, just like when comparing with string without BOM
``` shell
$ ruby -e 'p /\w+/.match?(File.read("u.txt")[2..-1])'
false
```
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>