[ruby-core:96800] [Ruby master Bug#16497] StringIO#internal_encoding is broken (more severely in 2.7)
From:
zverok.offline@...
Date:
2020-01-12 12:36:12 UTC
List:
ruby-core #96800
Issue #16497 has been updated by zverok (Victor Shepelev).
A bit more discussion on fixing the behavior, after discussing on Reddit.
Basically, it is two ways to fix it:
1. Just return to 2.6 behavior
2. Fully implement `internal_encoding` for StringIO
I believe **2 (make it use `internal_encoding`) is more useful**. One example where current behavior is not enough is substituting `StringIO` (for example, in tests) in code where `File` is usually used. For example, Hunspell's spellchecker data files use `SET <encoding>` tag as their first line. So, the code reading this file, could look this way:
```ruby
def read_aff(io)
encoding = read_encoding(io)
io.set_encoding(encoding, 'UTF-8') # read in whatever encoding it has, recode to internal encoding of the software
# now, I assume
io.read_line
# will be equivalent to
<read_bytes_from_io>.force_encoding(encoding).encode('UTF-8')
# ...which is true for File and false for StringIO
end
```
However, **option 1 (just make it behave like 2.6)** is more **backward-compatible**. Otherwise (full fix) this simple code will change behavior under Rails (which set `default_internal` to `UTF-8`):
```ruby
str = 'foo'.force_encoding('Windows-1251')
io = StringIO.new(str)
io.read
```
In 2.6 it returns string in `Windows-1251` encoding, and some code may rely on it. If we'll make StringIO respect `default_internal`, it will start to return recoded to UTF-8 string.
So, maybe this report should be **split into two:**
1. At least fix the bug (by returning to the previous behavior) in 2.7.1
2. Make `StringIO` respect internal_encoding in 2.8/3.0
WDYT?
----------------------------------------
Bug #16497: StringIO#internal_encoding is broken (more severely in 2.7)
https://bugs.ruby-lang.org/issues/16497#change-83795
* Author: zverok (Victor Shepelev)
* Status: Assigned
* Priority: Normal
* Assignee: nobu (Nobuyoshi Nakada)
* Target version:
* ruby -v:
* Backport: 2.5: DONTNEED, 2.6: DONTNEED, 2.7: REQUIRED
----------------------------------------
To the best of my understanding from [Encoding](https://docs.ruby-lang.org/en/master/Encoding.html) docs, the following is true:
* external encoding (explicitly specified or taken from `Encoding.default_external`) specifies how the IO understands input and stores it internally
* internal encoding (explicitly specified or taken from `Encoding.default_internal`) specifies how the IO converts what it reads.
Demonstration with regular files:
```ruby
# prepare data
File.write('test.txt', 'Україна'.encode('KOI8-U'), encoding: 'KOI8-U') #=> 7
def test(io)
str = io.read
[io.external_encoding, io.internal_encoding, str, str.encoding]
end
# read it:
test(File.open('test.txt', 'r:KOI8-U'))
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# We can specify internal encoding when opening the file:
test(File.open('test.txt', 'r:KOI8-U:UTF-8'))
# => [#<Encoding:KOI8-U>, #<Encoding:UTF-8>, "Україна", #<Encoding:UTF-8>]
# ...or when it is already opened
test(File.open('test.txt').tap { |f| f.set_encoding('KOI8-U', 'UTF-8') })
# => [#<Encoding:KOI8-U>, #<Encoding:UTF-8>, "Україна", #<Encoding:UTF-8>]
# ...or with Encoding.default_internal
Encoding.default_internal = 'UTF-8'
test(File.open('test.txt', 'r:KOI8-U'))
# => [#<Encoding:KOI8-U>, #<Encoding:UTF-8>, "Україна", #<Encoding:UTF-8>]
```
But with StringIO, **internal encoding can't be set** in Ruby **2.6**:
```ruby
require 'stringio'
Encoding.default_internal = nil
str = 'Україна'.encode('KOI8-U')
# Simplest form:
test(StringIO.new(str))
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# Try to set via mode
test(StringIO.new(str, 'r:KOI8-U:UTF-8'))
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# Try to set via set_encoding:
test(StringIO.new(str, 'r:KOI8-U:UTF-8').tap { |f| f.set_encoding('KOI8-U', 'UTF-8') })
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# Try to set via Enoding.default_internal:
Encoding.default_internal = 'UTF-8'
test(StringIO.new(str))
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
```
So, in 2.6, any attempt to do something with StringIO's internal encoding are **just ignored**.
In **2.7**, though, matters became much worse:
```ruby
require 'stringio'
Encoding.default_internal = nil
str = 'Україна'.encode('KOI8-U')
# Behaves same as 2.6
test(StringIO.new(str))
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# Try to set via mode: WEIRD behavior starts
test(StringIO.new(str, 'r:KOI8-U:UTF-8'))
# => [#<Encoding:UTF-8>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:UTF-8>]
# Try to set via set_encoding: still just ignored
test(StringIO.new(str, 'r:KOI8-U:UTF-8').tap { |f| f.set_encoding('KOI8-U', 'UTF-8') })
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# Try to set via Enoding.default_internal: WEIRD behavior again
Encoding.default_internal = 'UTF-8'
test(StringIO.new(str))
# => [#<Encoding:UTF-8>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:UTF-8>]
```
So, **2.7** not just ignores attempts to set **internal** encoding, but erroneously sets it to **external** one, so strings are not recoded, but their encoding is forced to change.
I believe it is severe bug (more severe than 2.6's "just ignoring").
[This Reddit thread](https://www.reddit.com/r/ruby/comments/emd6q4/is_this_a_stringio_bug_in_ruby_270/) shows how it breaks existing code:
* the author uses `StringIO` to work with `ASCII-8BIT` strings;
* the code is performed in Rails environment (which sets `internal_encoding` to `UTF-8` by default);
* under **2.7**, `StringIO#read` returns `ASCII-8BIT` content in Strings saying their encoding is `UTF-8`.
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>