ruby-core

Hi,

In message "Re: The face of Unicode support in the future"
    on Wed, 19 Jan 2005 19:34:41 +0900, Wes Nakamura <wknaka@pobox.com> writes:

|Will this be efficient enough?  When using a non-fixed-width encoding,
|String#[] won't run in constant time.

Right.  And you have to trust me, it's efficient enough for most of
the cases.  If you really care about efficiency, you can choose fixed
width encoding, which won't be slow even under M17N Ruby.

|1. This method is mentioned:
|
|     String#encoding, returns a string specifying the encoding
|   
|   But I haven't seen this, is there also:
|
|     String#encoding=
|   
|   I assume that setting the encoding would do nothing to the internal
|   representation of the string (based on char *), it would just affect
|   how methods that work on strings deal with characters, etc.


|2. What is the default encoding for strings?   What encoding would
|   String.new("") have #encoding set to?

The encoding of the script file.  See [ruby-core:04192].

|3. Are literal strings assumed to be a certain encoding, (encoding of
|   the script?) or can you specify an encoding at the time of creation?

The encoding of the script.

|3a. If there is a way of creating literal strings in other encodings,
|   is there also a way of creating literal regex's in other encodings?

If the encoding of the script is ascii (or binary, which is an alias
to ascii), you can do it by using octet (or decimal) string
representation + specifying encoding explicitly, e.g.

  # my family name in Japanese in euc-jp encoding
  "\244\336\244\304\244\342\244\310".encoding="euc-jp"

|3b. In \x{xxxx}, does the number have to be a 4-digit (hex) number?
|   How would you specify a utf-8 character, which can be more than 2 bytes?
|   Is the \x{} syntax basically \x{byte byte byte..}?

No, that is the very reason for braces around digits.  You can put
an arbitrary hexadecimal number in the braces.

|4. Will String#explode return an array of Fixnums, basically a byte array,
|   of the raw char * values?
|   
|   This would mean that s.explode.size is not necessarily == s.size

String explode (name might be changed) returns an array of fixnums,
which means s.explode.length == s.length (String#size now returns the
byte length of the string under the current M17N prototype, but I
consider it's a wrong decision, and will be fixed in the 1.9).

|5. When using String#[idx]= to set a single character, it must take as
|   an argument a string which has a size of 1 (i.e.  one codepoint) but
|   internally (i.e. #explode) doesn't necessarily have a size of 1?

It doesn't have to be a size of 1 anyway.  See [ruby-core:04276].

|6. Right now there is Fixnum#chr. Will there be Array#chr(encoding) or
|   something similiar?  So you could do something like:
|
|     [ 0x30, 0xb9 ].chr("utf-16")

I think it will be

  Integer#chr(encoding=script's_default)

to get a string corresponding a codepoint.  The is the place I haven't
made design decision.  But you will have something like this.

|7. Will strings that, when converted to the same encoding, are identical,
|   give different results for #intern when left in different encodings?
|
|   What happens to an interned string with a binary encoding?  Is it interned
|   based on the internal bytes of the string rather than the characters?

The is also the place I haven't made design decision.  Possible
options are:

  * restrict symbols to 7bit ascii
  * embed encoding info in Symbols
  * symbols just use byte sequence
  * something else I don't think of now.

							matz.

Thread

Prev Next

In This Thread

Prev Next