ruby-core

Eric Mahurin wrote:

> Nikolai Weibull wrote:

> > Eric Mahurin wrote:

> > > They could also be easily extended to Bignum when you need more
> > > from the encoding - an arbitrary sequence of bytes could be put
> > > into a Bignum.

> > Why would you want to do this?  What gives you the idea that a
> > Bignum would be more efficient than a String anyway?

> In general, you'd put a character in an Integer, not necessarily a
> Bignum.  For most common cases the character should fit in a Fixnum
> (ascii and unicode?).

Where would the encoding be stored?  The reason for indexation to return
a substring (as I see it) is that it's more uniform and you don't need
to treat a substring of length one as a special case and worry about
whether to treat is as a Fixnum (seriously, it's never going to be a
Bignum), a new Char class, or something else that just adds complexity.
The String#[](n) returning a Fixnum is sooo C anyway and I really don't
see why it has to stay that way.  Sure, it'll be less efficient, but how
often is this going to be a problem (excluding parsing - which everyone
seems to want to do using regular expressions anyway, so there's really
no problem with that either ;-).  Anyway, we're fucked as it is, as
every String will have an encoding associated with it, and as lookup is
O(n) for many encodings anyway, there's never going to be a reason to
optimize String#[](n) by saving some allocation work.

The end result is that String#[](n) will become equivalent to
String#[](n..n).  One can already get the "f"[0][0][0]… == "f" behavior
that someone pointed out by using 0..0 instead.

Nonetheless, one possible solution woul be to have some sort of
BaseString class (like Lisp has) that can only store ASCII strings, and
perhaps a UnicodeString class that stores Unicode-encoded strings.  This
would hopefully be easy enough to add, provided that the whole string
module of Ruby 2.0 is flexible enough in this respect (see below about
user-provided character encodings).

> > > The main exception to that I can think of is when you have an
> > > excess of zero bytes.  "\000\000\000\000" and "\000" couldn't be
> > > distinguished when you put them into an Integer.

> > So perhaps they shouldn't be?,

> I'm not sure what character encodings are going to be used.  As
> long as there are not both some encodings that start with zero
> bytes and some that end with zero bytes, you should be OK.

Why should strings not be able to contain a NUL?  Again, that's a C
remnant that makes no sense in a programming language like Ruby, where
strings shouldn't necessarily be terminated by a '\0'.  I really hope
that Ruby 2.0 won't continue to store Strings as C zero-terminated
strings any more.  This is an issue that has to be dealt with in the C
interface somehow, but seeing as the Ruby side of String is being
changed this is as good an opportunity as ever to change the C side
(hoho, what a pun) of it as well.

Furthermore, judging by what matz has said so far, users will be able to
add their own character encoding handlers, so this again has to be very
flexible.  Keeping everything uniform makes this a lot easier,
        nikolai

-- 
Nikolai Weibull: now available free of charge at http://bitwi.se/!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

Thread

Prev Next

In This Thread

Prev Next