From: "NARUSE, Yui" <naruse@...>
Date: 2012-11-09T21:38:15+09:00
Subject: [ruby-core:49165] Re: [ruby-trunk - Bug #7267] Dir.glob on Mac OS X returns unexpected string encodings for unicode file names

2012/11/9 kennygrant (Kenny Grant) <kennygrant@gmail.com>:
> Thanks for the comments on this issue. I'm not clear on what the UTF8-MAC encoding represents, are there docs on this Ruby behaviour and the problems involved somewhere?

see several lines at the end of enc/utf_8.c.

> It may return a filename marked UTF-8 which is NFD, or NFC, depending on the glob pattern you call it with (see writer.rb attachment to this issue). That's a small issue though and just indicates a wider complex problem.

writer.rb's two puts output the same result.
What do you mean?

>> An issue is people may write decomposed filename. A imaginary use case is a program which make a filename from the name of a music output from iTunes. iTunes manages texts with UTF8-MAC. So the people will confuse.
>
> OK, so in this case someone is unwittingly using a mix of UTF-8 NFC (any strings they create in ruby with legible accents) and UTF-8 NFD (any strings they get from itunes say) in their script, which could lead to issues even before writing file names. If they get NFD from itunes, then try to match on a track name with a regexp, it won't work unless they convert to NFC or explicitly create an NFD string will it?

It will work unless the regexp highly depends composed string.

> One thing I don't understand though, is that you say there are both in normal use - in use of Ruby ignoring file systems, if you create a string or regexp, NFC is the default isn't it?

No, NFC is not default.
The fact is that many IMEs outputs composed characters.
Once a decomposed characters is mixed in a string, the character lives as is.
It won't normalized.

> So Ruby has chosen one default for UTF-8 strings created in Ruby (as it must), but has to interact with lots of systems which might or might not be using NFC. At present we seem to have a de-facto default normalization of NFC, but nothing is translated to it when it comes from the OS. That might be a a very hard problem, but in principle it would be nice to have one normalization blessed as the default so that all strings in a given encoding are comparable. The results of leaving them as they are supplied are really unexpected, and people using Ruby are not going to want to manually convert every string they touch from outside Ruby to NFC in case it was touched by HFS or created as NFD.

Ruby don't normalize characters.
It treat them as they are.
Windows, Linux, and other file systems also don't normalize.

Moreover NFC/NFD lost information.
If a filename is decomposed characters on Windows or Linux, NFC for
the filename lost it.

>> First Ruby 1.9.0 set strings derived from filenames UTF8-MAC.
>> But some reported that if filenames is UTF8-MAC, it is hard to compare
>> with normal UTF-8 strings.
>
> This is interesting as it's exactly the behaviour I expected (if it's not possible to cleanly translate to NFC) - if strings are coming through as UTF-8 NFD, I'd expect them to be marked as such somehow (for example by being marked as encoding UTF8-MAC) - is there any indication?

A no so simple point is UTF8-MAC string is valid as UTF-8.

> Then at least it is clear that they are not comparable or compatible with the NFC ruby strings I get when creating a string s = "d�tente".

Even if the string is accidentally composed, there are no guarantee
that a string is always composed.

>> If the translation from UTF8-MAC -> UTF-8 is entirely non-lossy and would do no harm to other UTF-8 strings
>> Yes until all part of the converting string is truly UTF8-MAC.
>
> I assumed from others' comments that UTF8-MAC was purely a sub-encoding used to indicate the use of decomposed strings, but would appreciate some more detail (if anyone has a link) on what exactly it involves, and if translation from UTF8-MAC to UTF8 can lose information that implies other differences. If the only difference is the decomposition (patterns which do not occur in NFC), I'd expect re-encoding to be idempotent and not affect NFC strings and thus harmless to apply to NFC strings or strings containing a mix. Re the file-system example, I had assumed that if you ask HFS to write to a file on a mounted file system HFS would normalize all names to NFD (as it does for any HFS files), but perhaps that is incorrect.

A UTF-8 string is not always NFCed.

> I suppose the above boils down to this question:
>
> Is there a correct way to handle this situation, and never fail when comparing a default Ruby string (NFC) against a file from any file system which may be NFD?

No way.
And again, Ruby string is not NFC.

-- 
NARUSE, Yui  <naruse@airemix.jp>