ruby-core

In article <6.0.0.20.2.20071231173234.0a28c170@localhost>,
  Martin Duerst <duerst@it.aoyama.ac.jp> writes:

> I think it's a bit overkill to claim that we better use
> ASCII-8BIT than BINARY if out of a file that can easily
> be 100KB or more, and where virtually all byte values
> that look like ASCII characters are not characters at all,
> just because of three bytes at the start of the file.

GIF is just an example.

Another example is the internet mails.

After MIME, a mail may contain two or more texts with
different encoding.  So whole mail is BINARY.  But header is
basically ASCII and line oriented ASCII based processing is
required for decomposing multipart mail: extracting
"Content-Type" field body and "boundary" parameter.  The
extracted ASCII parameter is used to search the BINARY body.

How MIME library accepts a mail?  ASCII-8BIT or BINARY?
It should be BINARY in theory.  But it is not intuitive.

It seems Python-3000 faced this problem.
[Python-3000] Questions about email bytes/str (python 3000)
http://mail.python.org/pipermail/python-3000/2007-August/009503.html

Python has some experience on distinguish bytes and string.
I think we should study Python on this area.

Another example RFC 1468.

RFC 1468 (ISO-2022-JP) describes escape sequences:
"ESC ( B", etc.  The authors don't distinguish ASCII and
octet.

JIS X 0202 (Japanese version of ISO 2022) distinguish them.
In the style of JIS X 0202, they should be written as
"ESC 2/8 4/2", etc.

There is the culture which doesn't distinguish ASCII and
BINARY, especially with Unix and the Internet.  It may be
easy to distinguish them in most case.  But sometimes it is
not simple.

> Count uses just a very tiny part of the Regexp syntax,
> so see below.

count, delete, squeeze has no \xHH notation.  So we cannot
specify a byte in ASCII notation.  It is different from
Regexp.

> We of course can't do this. But it's not necessary. There
> are, simply put, two strings participating in a regexp
> operation: The regular expression and the 'target string'.
> The regular expression needs a 'real' encoding, in most
> cases ASCII-8BIT will be sufficient. The target string
> can be just bytes. We just need a few conventions to
> do the right things, e.g. we can agree that '.' matches
> one byte (rather than one character). That's very easy
> to implement. We can also limit non-meta characters to
> e.g. just \xHH notation, to make clear that we are just
> matching byte values and not actual characters. But
> imprementing that is probably quite a bit of a hassle
> for little benefit.

Of course it is easy to implement if we consider /A/ matches
BINARY 0x41.  It is what Ruby 1.9 does now with ASCII-8BIT.

But why BINARY is required if we need to consider LATIN
CAPITAL LETTER A is equal to BINARY 0x41?

It seems ASCII-8BIT.

Note that BINARY is an alias to ASCII-8BIT now.

% ./ruby -e 'p Encoding.find("BINARY")'
#<Encoding:ASCII-8BIT>

> Even currently, we can use an ASCII-8BIT regexp with
> many, many other encodings, so using it with BINARY
> isn't anything much new.

ASCII-8BIT regexp is usable with other encodings if the
regexp contains only ASCII characters, or target string
contains only ASCII characters.

If regexp and string has both non-ASCII character, match is
possible only if their encoding are same.

Since BINARY has no ASCII characters, non-empty BINARY
string has a non-ASCII character.  So ASCII-8BIT regexp is
not applicable in general.  This is the result of current
principle.

> Well, it's actually not so difficult, and it will be needed.
> We can't label a String as e.g. UTF-16, and claim that
> UTF-16 is some kind of ASCII-compatible encoding.

I'm not sure when matz introduce UTF-16.  I hope it is not
just before a release.  At the time, dereferences of char*
should be examined.
-- 
Tanaka Akira

Thread

Prev Next

In This Thread

Prev Next