[#29911] [Bug #3231] Digest Does Not Build — Charlie Savage <redmine@...>

Bug #3231: Digest Does Not Build

19 messages 2010/05/01

[#29920] [Feature #3232] Loops (while/until) should return last statement value if any, like if/unless — Benoit Daloze <redmine@...>

Feature #3232: Loops (while/until) should return last statement value if any, like if/unless

9 messages 2010/05/01

[#29997] years in Time.utc — Xavier Noria <fxn@...>

Does anyone have a precise statement about the years supported by

13 messages 2010/05/04

[#30010] [Bug #3248] extension 'tk' is finding tclConfig.sh and tkConfig.sh incorrectly — Luis Lavena <redmine@...>

Bug #3248: extension 'tk' is finding tclConfig.sh and tkConfig.sh incorrectly

9 messages 2010/05/05

[#30226] [Bug #3288] Segmentation fault - activesupport-3.0.0.beta3/lib/active_support/callbacks.rb:88 — Szymon Jeż <redmine@...>

Bug #3288: Segmentation fault - activesupport-3.0.0.beta3/lib/active_support/callbacks.rb:88

10 messages 2010/05/13

[#30358] tk doesn't startup well in doze — Roger Pack <rogerdpack2@...>

Currently with 1.9.x and tk 8.5,the following occurs

12 messages 2010/05/22

[ruby-core:30123] Suggestion regarding m17n

From: Czarek <cezary.baginski@...>
Date: 2010-05-09 20:50:38 UTC
List: ruby-core #30123
Hello fellow Rubyists and Ruby core developers! 


First of all, a disclaimer: I a not an expert on m17n or Ruby's m17n
implementation. I might be making fundamental errors, so please do not
hesitate to point them out in a direct way.

This is probably a long read but I would really appreciate any time
spent on feedback or in general - helping others make the most of what
Ruby provides regarding multilingualization.


  M17N FROM A SIMPLE DEVELOPER'S PERSPECTIVE


After many hours spent on learning about m17n in Ruby and encoding
issues, I have been banging my head against the wall, trying to figure
out how to help developers having the same issues while attempting to
write robust applications and frameworks.

This is difficult for a few reasons:

  1. Correct understanding of encoding issues requires years of
  experience and an incredible amount of knowledge.

  2. Many developers do not experience such issues until their users
  do.

  3. Developers under pressure try to workaround problems encoding to
  UTF-8 or ASCII-8BIT.

  4. Tracking down root causes of encoding incompatibilities is
  difficult.

  5. Ruby 1.8 doesn't support encoding, which makes
  backward-compatible workarounds look quite ugly.

  6. Ruby allows joining ascii compatible strings as a special case
  which is great for backward compatibility, but makes development and
  debugging harder, especially since things "work most of the time".

  7. There are great articles about m17n features and changes in Ruby
  1.9, but most of the present only the available functionality
  instead of describing how to fix problems, avoid them and what to
  look out for.

    Notable exceptions:

    * obviously James Edward Gray's "Shades of Gray" blog

    * "Ruby Best Practices" by Gregory Brown (covers James's regexp
      idea used in CSV, and how to deal with encoding problems)

    * Yehuda's recent, in-depth article on his blog:

        "Ruby 1.9 Encodings: A Primer and the Solution for Rails"

    * comments on Redmine, where core Ruby developers share their
      knowledge and more importantly - do it in a very clear and
      concise way.


Encoding is really a very difficult concept to implement correctly and
Ruby does a great job providing a CSI approach, while minimizing the
drawbacks.  Sometimes I think it is a pity that people are unaware how
much effort has been put into Ruby to achieve this.


  MY QUESTIONS


1. Does it make sense to expect libraries and frameworks for Ruby to
work with an ASCII incompatible internal encoding, e.g "-E :UTF16_BE"?
Could we consider failing to do so a bug?


2. If so, would it be reasonable to expect the same of Ruby's standard
  library?


3. If so, which Ruby version should people test against?

  * 1.9.1-p376 (as presented on ruby-lang.org)

  * 1.9.1-p378

  * 1.9.1-head

  * a specific 1.9.2 revision?


4. Would trying to add support to applications running in ASCII
incompatible environments be useful to developers in the long run
(taking into account future versions of Ruby) or just be an
unnecessary activity (e.g. because of UTF-8 which is ASCII
compatible)?


  EXPLANATION


Now, before I am misunderstood (or probably even laughed at for
proposing something so unthinkable as globally setting an
ASCII-incompatible encoding and expecting it to be supported), I want
to make a few things clear:

  - I am not trying to be an "encoding purist" suggesting this idea

  - I am not carelessly "adding more work" for other people by
    "expecting UTF-16 to work" or "get fixed"

  - I am not proposing a UCS model in place of the existing ICS one

  - I am not suggesting people change their default encodings

  - I am not suggesting changing anything in m17n handling in Ruby


I am evaluating if *testing* encoding support in applications with a
non-ASCII compatible default_internal makes sense.


  REASONS


This would help developers by helping them:

  - find encoding problems before they occur and are reported

  - reproduce problems without creating specific test cases or just write
  fewer encoding specific test cases

  - discover the root causes much faster

  - prevent a lot of encoding related regressions

  - performance-wise - making sure no encoding happens implicitly

  - smooth out some possibly hidden and unreported encoding issues in Ruby

  - possibly write more robust application that might gracefully
    handle changes in Ruby's m17n implementation

  - respect the fact that not everyone can "just use UTF-8" and truly
    globalize their applications with less effort and issues in the
    long run

  - in the worst case scenario where this is too much effort, either
    specific components can be supported or incompatible encoding
    support can be discarded on a case-by-case basis


The downside would be:

  - many more problems will surface than would normally occur in
    reality

  - people interested only in ASCII compatible encodings would have to
    do more work or give up on encoding support

  - developers may have to become aware of regexp, 'filesystem',
    'locale' and other encodings - even if they want things to "just
    work"

  - problems would have to solved near the cause, which requires time
    and knowledge to do properly - adding ".encoding" calls in random
    places won't be enough

  - handling all the issues in the most correct and
    backward-compatible way would make the code uglier, especially
    with regexps

  - effort needed to persuade developers to pro-actively test their
    software this way, instead of expecting test cases and patches.


  CASE STUDY:


As an experiment, I tried to get irb and then tests to run without
warnings, starting with the following:

  Rubygems:

     % ruby -v -E :UTF-16BE -e 'puts "hello"' 
    ruby 1.9.2dev (2010-02-04 trunk 26559) [x86_64-linux]
    Error loading gem paths on load path in gem_prelude
    incompatible character encodings: UTF-16BE and ASCII-8BIT
    <internal:gem_prelude>:76:in `split'
    <internal:gem_prelude>:76:in `set_paths'
    <internal:gem_prelude>:47:in `path'
    <internal:gem_prelude>:228:in `push_all_highest_version_gems_on_load_path'
    <internal:gem_prelude>:294:in `<compiled>'
    hello

And ended up with:

  http://github.com/e2/ruby/compare/trunk...utf16_fix

Although getting all the tests running will still require touching
quite a few files.

NOTE: I tried my best to when choosing each workaround, but I believe
there are much cleaner solutions out there. I also admit that I gave
up on trying to get irb to actually do something other than just
start.


It turns out there are usually just a few tiny patches required in
every library, that a developer with commit access could do on his own
in a short while. Once this is done, the remaining issues can be
discovered while using the standard library. And it would be a great
example of how to correctly use the m17n functionality in Ruby.


The problems I encountered were usually related to:

  - regexp handling

  - ENV variables (reading and writing) in locale which were not
    compatible with default_internal

  - using concat operations like File.join, split, interpolation 

  - comparing strings with different encodings silently returns false 


Some other issues that may become more apparent as a result of trying
to get things to work:

  - filesystem encoding / locale mismatches

  - program argument encodings

  - file system encoding differences for different mount points 

  - environment variables containing non-ascii or non-utf data

  - lack of encoding info from stdlib

  - stdin/stdout issues

  - gracefully handling corrupt data

  - combinations of the above and other


As a side note, I was wondering if regexp support couldn't be extended
to better support what was done in the CSV library - encoding the
regexp to match the default/input. Perhaps with a cleaner syntax or
making the existing notation more flexible, since regexp are used like
a Swiss Army Knife for many different things. Then again, I may be
missing something important.

I understand what I propose may be crazy and require a insane amount
of work to support. Although I do *not* believe that the flexibility
Ruby provides and the support of vague cases (e.g. ASCII-7BIT safe
concat, default encodings) is meant to be treated as a standard for
new applications now and in the future.

Instead, IMHO this flexibility should be used appropriately: for easy
transitions, backward compatibility, performance and other instances
that are genuinely useful. In every other case, I believe putting
additional effort into avoiding transcoding where possible and
generally honoring the user's selection of encoding, whatever he or
she may choose, will provide more benefit for everyone in the long
run.

Please correct me if I am wrong.



Thank you for your kind interest and precious time,


    Cezary Bagiナгki


-- 
Cezary Baginski

Attachments (1)

signature.asc (197 Bytes, application/pgp-signature)

In This Thread

Prev Next