From: "YO4 (Yoshinao Muramatsu)" Date: 2022-03-09T14:01:16+00:00 Subject: [ruby-dev:51168] [Ruby master Bug#18588] ruby -e 'p gets' with japanese charactors gets additional invalid leading chars and caught Encoding::InvalidByteSequenceError Issue #18588 has been updated by YO4 (Yoshinao Muramatsu). It seems to ANSI version of PeekConsoleInput read multibyte charactor partially, subsequent ReadFile returns wrong data on newer Windows 10 versions. I reported this to microsoft/terminal (https://github.com/microsoft/terminal/issues/12626) To avoid this behavior, we can use Unicode version of of PeekConsoleInput/ReadConsoleInput. PR https://github.com/ruby/ruby/pull/5634. ---------------------------------------- Bug #18588: ruby -e 'p gets' with japanese charactors gets additional invalid leading chars and caught Encoding::InvalidByteSequenceError https://bugs.ruby-lang.org/issues/18588#change-96734 * Author: YO4 (Yoshinao Muramatsu) * Status: Open * Priority: Normal * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- ### Input a line starting with japanese charactor from console, almost every time ruby gets additional invalid leading charactors. ## Reproduce process ``` R:\ruby32\bin>ruby -e 'p gets' あ -e:1:in `gets': "\\xA0" on Windows-31J (Encoding::InvalidByteSequenceError) from -e:1:in `gets' from -e:1:in `
' ``` ## expected result ``` R:\ruby32\bin>ruby -e 'p gets' あ "あ" ``` ## your ruby version (ruby -v) ``` R:\ruby32\bin>ruby -v ruby 3.2.0dev (2022-02-16T08:57:04Z master 00c7a0d491) [x64-mswin64_140] R:\ruby32\bin>ver Microsoft Windows [Version 10.0.19043.1526] ``` ## other observations ### environment * On command prompt window with Legacy Console mode, this issue NOT occurs. * On Windows Terminal, this issue occurs. * On Windows Sandbox(Japanese Locale), this issue occurs. * RubyInstaller binaries has same issue ``` C:\src\git>ruby -v ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x64-mingw-ucrt] C:\src\git>ruby -Eutf-8 -e 'p gets' あ -e:1:in `gets': "\\xA0" on Windows-31J (Encoding::InvalidByteSequenceError) from -e:1:in `gets' from -e:1:in `
' ``` ### A line starting with single byte charactor(s) got valid value. ``` R:\ruby32\bin>ruby -e 'p gets' :あ ":あ\n" # <= valid ``` ### external encoding affects * with Windows-31J, second enter key for line input. ``` R:\ruby32\bin>ruby -EWindows-31J -e 'p gets' あ # <= Second enter key required "\xA0\xFFあ\n" # <= \xA0\xFF is additional chars ``` ### charactor variations ``` R:\ruby32\bin>ruby -EWindows-31J -e 'p gets.b' あ # <= \x{82A0} "\xA0\xFF\x82\xA0\n" R:\ruby32\bin>ruby -EWindows-31J -e 'p gets.b'   # <= \x{8140} fullwidth space "@\x00\x81@\n" R:\ruby32\bin>ruby -EWindows-31J -e 'p gets.b' 、 # <= \x{8141} "A\x00\x81A\n" R:\ruby32\bin>ruby -EWindows-31J -e 'p gets.b' 。 # <= \x{8142} "B\x00\x81B\n" ``` ### sysread got valid value. ``` R:\ruby32\bin>ruby -e 'p STDIN.sysread(1024).force_encoding(Encoding::Windows_31J)' あ "\x{82A0}\r\n" # <= valid ``` ### STDIN.binmode can not resolv this. ``` R:\ruby32\bin>ruby -e 'STDIN.binmode; p gets.force_encoding(Encoding::Windows_31J)' あ # <= Second enter key required "\xA0\xFF\x{82A0}\r\r\n" # <= invalid ``` ### Ruby 3.0 and earlier versions has a different behavior. especialy sysread returns invalid. ``` C:\src\git>ruby -v ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x64-mingw32] C:\src\git>ruby -Eutf-8 -e 'p gets' あ # <= Second enter key required "\xA0\xFF\x82\xA0\n" # <= exception not occures but invalid value C:\src\git>ruby -EWindows-31J -e 'p gets' あ # <= Second enter key required "\xA0\xFFあ\n" # <= also invalid value C:\src\git>ruby -e 'p STDIN.sysread(1024).force_encoding(Encoding::Windows_31J)' あ "\xA0\xFF\x{82A0}\r" ``` ## conclusion 1. ruby 3.1/3.2dev gets return invalid vs sysread return valid 1. ruby 3.1/3.2dev sysread return valid vs 3.0 sysread return invalid 1. The fact that it works fine in legacy console suggests that windows has some issue, but from the previous it looks like ruby can handle it. -- https://bugs.ruby-lang.org/