From: thomas@... Date: 2014-04-07T19:46:22+00:00 Subject: [ruby-core:61899] [ruby-trunk - Bug #9713] __FILE__ return unexpected encoding - breaks Dir.glob Issue #9713 has been updated by Thomas Thomassen. Usaku NAKAMURA wrote: > But, the first case, the encoding of __FILE__ should be Windows-1252 (filesystem encoding) > or UTF-8 (script's encoding), I think. > It may be a bug. Seeing how the Windows file system can use Unicode characters I would expect __FILE__ to be unicode encoded. Even if the file encoding was different. The file system doesn't store file names in Windows-1252 encoded data, that's just the fallback compatibility code page for programs that doesn't declare them selves as Unicode capable. Ruby doesn't do this - it doesn't seem to declare the UNICODE flag, but instead explicitly calls the *W variant of the file functions. If I need to represent a file name in the UI some way, or write to file, in a different encoding then I can do the appropriate transposing. But I don't see any reason why Ruby's file related functions under Windows should yield any strings that are not Unicode. ---------------------------------------- Bug #9713: __FILE__ return unexpected encoding - breaks Dir.glob https://bugs.ruby-lang.org/issues/9713#change-46107 * Author: Thomas Thomassen * Status: Open * Priority: Normal * Assignee: cruby-windows * Category: platform/windows * Target version: current: 2.2.0 * ruby -v: ruby 2.2.0dev (2014-04-07 trunk 45528) [i386-mswin32_100] * Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN ---------------------------------------- **C:/���������/FILE.rb:** ~~~ # encoding: UTF-8 puts "Encoding.find 'filesystem': #{Encoding.find('filesystem').inspect}" puts "Encoding.find 'locale': #{Encoding.find('locale').inspect}" puts "Encoding.default internal: #{Encoding.default_internal.inspect}" puts "Encoding.default external: #{Encoding.default_external.inspect}" puts "Encoding.locale_charmap: #{Encoding.locale_charmap.inspect}" puts "__FILE__: #{__FILE__.encoding.inspect}" puts "'foobar': #{'foobar'.encoding.inspect}" ~~~ **C:/FILE.rb:** ~~~ # encoding: UTF-8 puts "Encoding.find 'filesystem': #{Encoding.find('filesystem').inspect}" puts "Encoding.find 'locale': #{Encoding.find('locale').inspect}" puts "Encoding.default internal: #{Encoding.default_internal.inspect}" puts "Encoding.default external: #{Encoding.default_external.inspect}" puts "Encoding.locale_charmap: #{Encoding.locale_charmap.inspect}" puts "__FILE__: #{__FILE__.encoding.inspect}" puts "'foobar': #{'foobar'.encoding.inspect}" puts "" puts "Loading C:/���������/FILE.rb ..." require "C:/���������/FILE.rb" ~~~ **Results:** ![](media-20140407.png) ~~~ c:\ruby-220\usr\bin>ruby "C:\FILE.rb" Encoding.find 'filesystem': # Encoding.find 'locale': # Encoding.default internal: nil Encoding.default external: # Encoding.locale_charmap: "CP437" __FILE__: # 'foobar': # Loading C:/???/FILE.rb ... Encoding.find 'filesystem': # Encoding.find 'locale': # Encoding.default internal: nil Encoding.default external: # Encoding.locale_charmap: "CP437" __FILE__: # 'foobar': # c:\ruby-220\usr\bin> ~~~ Now, lets see how this affects Dir.glob: Test scenario - a folder structure like this: ~~~ C:/test/ C:/test/foo/ C:/test/���������/ ~~~ **C:/FILE.rb** ~~~ # encoding: UTF-8 puts "Encoding.find 'filesystem': #{Encoding.find('filesystem').inspect}" puts "Encoding.find 'locale': #{Encoding.find('locale').inspect}" puts "Encoding.default internal: #{Encoding.default_internal.inspect}" puts "Encoding.default external: #{Encoding.default_external.inspect}" puts "Encoding.locale_charmap: #{Encoding.locale_charmap.inspect}" puts "__FILE__: #{__FILE__.encoding.inspect}" puts "'foobar': #{'foobar'.encoding.inspect}" puts "" pattern = File.join(File.dirname(__FILE__), "test", "*") puts "pattern.encoding: #{pattern.encoding.inspect}" result = Dir.glob(pattern) p result p result.map { |file| file.encoding } puts "" puts "force encoding:" pattern.force_encoding("UTF-8") result = Dir.glob(pattern) p result p result.map { |file| file.encoding } ~~~ **Result:** ~~~ c:\ruby-220\usr\bin>ruby "C:\FILE.rb" Encoding.find 'filesystem': # Encoding.find 'locale': # Encoding.default internal: nil Encoding.default external: # Encoding.locale_charmap: "CP437" __FILE__: # 'foobar': # pattern.encoding: # ["C:/test/foo", "C:/test/???"] [#, #] force encoding: ["C:/test/foo", "C:/test/\u3066\u3059\u3068"] [#, #] c:\ruby-220\usr\bin> ~~~ Observe how when Dir.glob is fed a string based on __FILE__ it will return strings in the same encoding, even though the string should include Unicode characters. The Unicode characters are replaced by question marks. (Actual ASCII bytes for question mark: 63) Just by forcing the input string to UTF-8 will make Dir.glob return the expected strings with correct Unicode characters. I'm unsure of where the bug lies, but in terms of what I expected I would not have expected __FILE__ to return different encoding depending on the executing file containing Unicode characters. All files have been marked as UTF-8 in the file header. ---Files-------------------------------- media-20140407.png (83.1 KB) -- https://bugs.ruby-lang.org/