From: thomas@... Date: 2014-04-07T20:45:06+00:00 Subject: [ruby-core:61901] [ruby-trunk - Bug #9713] __FILE__ return unexpected encoding - breaks Dir.glob Issue #9713 has been updated by Thomas Thomassen. Looking at how Ruby determines the filesystem encoding: http://rxr.whitequark.org/mri/source/encoding.c#1267 static int enc_set_filesystem_encoding(void) ~~~ 1266 char cp[sizeof(int) * 8 / 3 + 4]; 1267 snprintf(cp, sizeof cp, "CP%d", AreFileApisANSI() ? GetACP() : GetOEMCP()); 1268 idx = rb_enc_find_index(cp); 1269 if (idx < 0) idx = ENCINDEX_ASCII; ~~~ It's asking between OEM CP and ASCII CP - both of which are not Unicode. So Ruby will under Windows always try to return using ASCII or the OEM code page? I can understand the desire for compatibility, but I'd wish for some better control - switches when you compile it so it was possible to set up Ruby under Windows where it wasn't necessary to juggle all these different encoding types. For __FILE__ to use UTF-8 selectively when it contains bytes outside of the filesystem CP seems very erratic. And as can be seen with the Dir.glob function it causes failure cascading further down the ruby scripts as some functions use inconsistent encoding. ---------------------------------------- Bug #9713: __FILE__ return unexpected encoding - breaks Dir.glob https://bugs.ruby-lang.org/issues/9713#change-46109 * Author: Thomas Thomassen * Status: Open * Priority: Normal * Assignee: cruby-windows * Category: platform/windows * Target version: current: 2.2.0 * ruby -v: ruby 2.2.0dev (2014-04-07 trunk 45528) [i386-mswin32_100] * Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN ---------------------------------------- **C:/���������/FILE.rb:** ~~~ # encoding: UTF-8 puts "Encoding.find 'filesystem': #{Encoding.find('filesystem').inspect}" puts "Encoding.find 'locale': #{Encoding.find('locale').inspect}" puts "Encoding.default internal: #{Encoding.default_internal.inspect}" puts "Encoding.default external: #{Encoding.default_external.inspect}" puts "Encoding.locale_charmap: #{Encoding.locale_charmap.inspect}" puts "__FILE__: #{__FILE__.encoding.inspect}" puts "'foobar': #{'foobar'.encoding.inspect}" ~~~ **C:/FILE.rb:** ~~~ # encoding: UTF-8 puts "Encoding.find 'filesystem': #{Encoding.find('filesystem').inspect}" puts "Encoding.find 'locale': #{Encoding.find('locale').inspect}" puts "Encoding.default internal: #{Encoding.default_internal.inspect}" puts "Encoding.default external: #{Encoding.default_external.inspect}" puts "Encoding.locale_charmap: #{Encoding.locale_charmap.inspect}" puts "__FILE__: #{__FILE__.encoding.inspect}" puts "'foobar': #{'foobar'.encoding.inspect}" puts "" puts "Loading C:/���������/FILE.rb ..." require "C:/���������/FILE.rb" ~~~ **Results:** ![](media-20140407.png) ~~~ c:\ruby-220\usr\bin>ruby "C:\FILE.rb" Encoding.find 'filesystem': # Encoding.find 'locale': # Encoding.default internal: nil Encoding.default external: # Encoding.locale_charmap: "CP437" __FILE__: # 'foobar': # Loading C:/???/FILE.rb ... Encoding.find 'filesystem': # Encoding.find 'locale': # Encoding.default internal: nil Encoding.default external: # Encoding.locale_charmap: "CP437" __FILE__: # 'foobar': # c:\ruby-220\usr\bin> ~~~ Now, lets see how this affects Dir.glob: Test scenario - a folder structure like this: ~~~ C:/test/ C:/test/foo/ C:/test/���������/ ~~~ **C:/FILE.rb** ~~~ # encoding: UTF-8 puts "Encoding.find 'filesystem': #{Encoding.find('filesystem').inspect}" puts "Encoding.find 'locale': #{Encoding.find('locale').inspect}" puts "Encoding.default internal: #{Encoding.default_internal.inspect}" puts "Encoding.default external: #{Encoding.default_external.inspect}" puts "Encoding.locale_charmap: #{Encoding.locale_charmap.inspect}" puts "__FILE__: #{__FILE__.encoding.inspect}" puts "'foobar': #{'foobar'.encoding.inspect}" puts "" pattern = File.join(File.dirname(__FILE__), "test", "*") puts "pattern.encoding: #{pattern.encoding.inspect}" result = Dir.glob(pattern) p result p result.map { |file| file.encoding } puts "" puts "force encoding:" pattern.force_encoding("UTF-8") result = Dir.glob(pattern) p result p result.map { |file| file.encoding } ~~~ **Result:** ~~~ c:\ruby-220\usr\bin>ruby "C:\FILE.rb" Encoding.find 'filesystem': # Encoding.find 'locale': # Encoding.default internal: nil Encoding.default external: # Encoding.locale_charmap: "CP437" __FILE__: # 'foobar': # pattern.encoding: # ["C:/test/foo", "C:/test/???"] [#, #] force encoding: ["C:/test/foo", "C:/test/\u3066\u3059\u3068"] [#, #] c:\ruby-220\usr\bin> ~~~ Observe how when Dir.glob is fed a string based on __FILE__ it will return strings in the same encoding, even though the string should include Unicode characters. The Unicode characters are replaced by question marks. (Actual ASCII bytes for question mark: 63) Just by forcing the input string to UTF-8 will make Dir.glob return the expected strings with correct Unicode characters. I'm unsure of where the bug lies, but in terms of what I expected I would not have expected __FILE__ to return different encoding depending on the executing file containing Unicode characters. All files have been marked as UTF-8 in the file header. ---Files-------------------------------- media-20140407.png (83.1 KB) -- https://bugs.ruby-lang.org/