From: nobu@... Date: 2020-01-10T13:18:31+00:00 Subject: [ruby-core:96763] [Ruby master Bug#16497] StringIO#internal_encoding is broken (more severely in 2.7) Issue #16497 has been updated by nobu (Nobuyoshi Nakada). Backport changed from 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN to 2.5: DONTNEED, 2.6: DONTNEED, 2.7: REQUIRED Assignee set to nobu (Nobuyoshi Nakada) Status changed from Open to Assigned ---------------------------------------- Bug #16497: StringIO#internal_encoding is broken (more severely in 2.7) https://bugs.ruby-lang.org/issues/16497#change-83757 * Author: zverok (Victor Shepelev) * Status: Assigned * Priority: Normal * Assignee: nobu (Nobuyoshi Nakada) * Target version: * ruby -v: * Backport: 2.5: DONTNEED, 2.6: DONTNEED, 2.7: REQUIRED ---------------------------------------- To the best of my understanding from [Encoding](https://docs.ruby-lang.org/en/master/Encoding.html) docs, the following is true: * external encoding (explicitly specified or taken from `Encoding.default_external`) specifies how the IO understands input and stores it internally * internal encoding (explicitly specified or taken from `Encoding.default_internal`) specifies how the IO converts what it reads. Demonstration with regular files: ```ruby # prepare data File.write('test.txt', '��������������'.encode('KOI8-U'), encoding: 'KOI8-U') #=> 7 def test(io) str = io.read [io.external_encoding, io.internal_encoding, str, str.encoding] end # read it: test(File.open('test.txt', 'r:KOI8-U')) # => [#, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #] # We can specify internal encoding when opening the file: test(File.open('test.txt', 'r:KOI8-U:UTF-8')) # => [#, #, "��������������", #] # ...or when it is already opened test(File.open('test.txt').tap { |f| f.set_encoding('KOI8-U', 'UTF-8') }) # => [#, #, "��������������", #] # ...or with Encoding.default_internal Encoding.default_internal = 'UTF-8' test(File.open('test.txt', 'r:KOI8-U')) # => [#, #, "��������������", #] ``` But with StringIO, **internal encoding can't be set** in Ruby **2.6**: ```ruby require 'stringio' Encoding.default_internal = nil str = '��������������'.encode('KOI8-U') # Simplest form: test(StringIO.new(str)) # => [#, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #] # Try to set via mode test(StringIO.new(str, 'r:KOI8-U:UTF-8')) # => [#, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #] # Try to set via set_encoding: test(StringIO.new(str, 'r:KOI8-U:UTF-8').tap { |f| f.set_encoding('KOI8-U', 'UTF-8') }) # => [#, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #] # Try to set via Enoding.default_internal: Encoding.default_internal = 'UTF-8' test(StringIO.new(str)) # => [#, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #] ``` So, in 2.6, any attempt to do something with StringIO's internal encoding are **just ignored**. In **2.7**, though, matters became much worse: ```ruby require 'stringio' Encoding.default_internal = nil str = '��������������'.encode('KOI8-U') # Behaves same as 2.6 test(StringIO.new(str)) # => [#, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #] # Try to set via mode: WEIRD behavior starts test(StringIO.new(str, 'r:KOI8-U:UTF-8')) # => [#, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #] # Try to set via set_encoding: still just ignored test(StringIO.new(str, 'r:KOI8-U:UTF-8').tap { |f| f.set_encoding('KOI8-U', 'UTF-8') }) # => [#, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #] # Try to set via Enoding.default_internal: WEIRD behavior again Encoding.default_internal = 'UTF-8' test(StringIO.new(str)) # => [#, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #] ``` So, **2.7** not just ignores attempts to set **internal** encoding, but erroneously sets it to **external** one, so strings are not recoded, but their encoding is forced to change. I believe it is severe bug (more severe than 2.6's "just ignoring"). [This Reddit thread](https://www.reddit.com/r/ruby/comments/emd6q4/is_this_a_stringio_bug_in_ruby_270/) shows how it breaks existing code: * the author uses `StringIO` to work with `ASCII-8BIT` strings; * the code is performed in Rails environment (which sets `internal_encoding` to `UTF-8` by default); * under **2.7**, `StringIO#read` returns `ASCII-8BIT` content in Strings saying their encoding is `UTF-8`. -- https://bugs.ruby-lang.org/ Unsubscribe: