ruby-core

Hi,

First of all, String#encode should be a simple API.  For complex uses, 
Encoding::Converter or something is suitable.  So the problem is, where 
is the border between simple and complex.

Martin Duerst wrote:
> I'm now looking for comments on how to name these and further options.
> 
> invalid: What to do for an invalid byte (sequence) in the input

compared with iconv(3) of SUSv3,
http://www.unix.org/single_unix_specification/
http://www.opengroup.org/onlinepubs/000095399/functions/iconv.html

"invalid" corresponds with following two cases.
* "If a sequence of input bytes does not form a valid character in the 
specified codeset"
* "If the input buffer ends with an incomplete character or shift sequence"

The spec, String#encode doesn't distinct them, seems reasonable.  When 
this difference is important, another complex method is suitable.

The name of this can be "decoder fallback" or something refer to other 
UCS based converers.

> unknown: What to do if the target encoding doesn't include the character

"unknown" corresponds with "If iconv() encounters a character in the 
input buffer that is valid, but for which an identical character does 
not exist in the target codeset".

The name of this can be "encoder fallback" or something refer to other 
UCS based converers.

> ???: We may need a third option, to indicate a combination of invalid
>      and unknown.

The differnece between illegal byte sequence and incomplete character or 
shift sequence may come to be a third option.  But I don't think there 
are needs to identify then at String#encode.

> Values for each of the above options could include:
> 
> :ignore - Ignore/drop the problem data.

:ignore have some security issue.
http://support.microsoft.com/kb/940521

This function is also available by :substitute with empty string.

> :substitute (or :subst or so to be shorter) - Use an
>           (encoding-dependent) substitution character.

:substitute is needed and can be the default behavior.  The name of this 
can be :replacement.

cf. EncoderReplacementFallback
http://msdn2.microsoft.com/en-us/library/system.text.encoderreplacementfallback.aspx

> :warn   - Produce a warning, helpful for debugging.

this is realy needed?

> :error  - The current behavior, available just for completeness.

:exception seems better than :error.  This raises not an error but an 
exception.

> :stop   - Stop transcoding, for encode! this will mean
>           loosing the rest of the string.

this is realy needed?

> :x_escape - add problem data to the output using \x escapes
> 
> :u_escape - add problem characters to the output using \u escapes
>             (unknown: only)
> 
> :hex_ncr - add problem characters to the output using XML/HTML
>            hex escapes (&#xhhhh;, unknown: only)
> 
> :dec_ncr - add problem characters to the output using XML/HTML
>            dec escapes (&#ddddd;, unknown: only)
> 
> :uri_escape - add problem characters to the output using
>            UTF-8->URI %-encoding conversion (for IRI->URI
>            conversion and similar things, unknown: only)

Needed for performance.

> :block - Use result of block, with interface to be worked out
>          (only needed to indicate that a block is used for
>           one case but not for the other)

As Gary said, giving block seems better. Or simply give proc or lambda. 
  But block's parameter needs more discussion.

> 'string' - Replace by string (have to work out details about
>            encoding,...)

The encoding of replacement string will be that of target.  But how 
treat replaced characters duaring conversion is problem.  (give them the 
special codepoint or byte array or struct?)

-- 
NARUSE, Yui  <naruse@airemix.com>
DBDB A476 FDBD 9450 02CD 0EFC BCE3 C388 472E C1EA

Thread

Prev Next

In This Thread

Prev Next