From: Joshua Ballanco Date: 2012-04-30T01:50:11+09:00 Subject: [ruby-core:44759] Re: [ruby-trunk - Feature #6361] Bitwise string operations --4f9d713e_5d5babb3_103 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline On Saturday, April 28, 2012 at 8:52 AM, KOSAKI Motohiro wrote: > On =46ri, Apr 27, 2012 at 8:53 PM, MartinBosslet (Martin Bosslet) > wrote: > > =20 > > Issue =236361 has been updated by MartinBosslet (Martin Bosslet). > > =20 > > =20 > > nobu (Nobuyoshi Nakada) wrote: > > > Then what kind of methods should Blob have=3F > > > =20 > > > And does it need to be built-in=3F > > =20 > > A real advantage of having it built-in could be > > that this gives us the chance to fix =235741 at > > the same time. I could imagine that we have two > > kinds of =22byte array=22 classes - one, mutable, > > that shares COW semantics and all the other > > optimizations with String, but with no notion of > > encoding and a yet-to-be-defined interface. > > =20 > > And then a second class, which is basically the > > immutable version of the first one. By sharing > > only a reference we could ensure that the content > > would not be proliferated and we could securely > > erase its contents after use. > > =20 > =20 > =20 > I don't dislike a bult-in idea. But you haven't show a detailed spec > and I don't think I clearly understand your idea. Can you spend a > few time for writing a spec=3F (probably rough a few line explanation > is enough) > =20 > =20 > =20 If I may intrude for a moment=E2=80=A6 I think the advantage to having a = built in Data/Blob library would be that it could be used in all places w= here a data class is more appropriate than a string. =46or example, the S= ocket library currently returns Strings for data read in from a socket. I= think a Data class is more appropriate here since the socket itself does= not contain encoding information (i.e. either an arbitrary default encod= ing needs to be set, a heuristic can be used to guess the encoding, or th= e encoding is set by a previously agreed up convention; but you cannot as= k a socket for its encoding). As for a spec, I think it should be kept relatively simple. The one inter= esting optimization from NSData that might be useful is the option of cop= ying bytes on instantiation. Copying is the default, but it is also possi= ble to create a Data object that merely points at the storage of another = live object and allows byte-wise manipulation. This is particularly inter= esting for the case of strings, since I would guess that String and Data = would have identical storage layout, allowing one to optimize the case of= creating a Data from a String with no copying. A quick attempt at a spec: ----- Data.new =23=3D> New, dynamically resizable container to store some bytes= Data.new('Test') =23=3D> Can be created from any object that responds to = =23bytes with an enumerator Data.new('Hello', copy=5Fbytes: false) =23=3D> Creates the Data from the = String by merely pointing to the same storage Data.open('./foo/test.txt') =23=3D> Create a Data object from a =46ile Data.open('./bar/test.txt', copy=5Fbytes: false) =23=3D> Same as open abo= ve, but manipulates IO=23pos for access Data.write('./baz/test.txt') =23=3D> Writes the bytes to disk. d =3D Data.new(a=5Fstring) d=5B2=5D =23=3D> Returns the third byte, same as a=5Fstring.bytes.to=5Fa=5B= 2=5D d=5B2=5D =3D 42 =23=3D> Same as a=5Fstring.setbyte(2, 42) d.each =23=3D> Equivalent to a=5Fstring.each=5Fbyte d.length =23=3D> Number of bytes currently being stored d.slice(2, 4) =23=3D> Similar to String=23slice d.slice(2, 4, copy=5Fbytes: false) =23=3D> New data object from slice sha= res storage with the original d << other=5Fdata =23=3D> Appends bytes from other=5Fdata d.to=5Fs =23=3D> Returns a string using the default internal encoding d.string=5Fwith=5Fencoding('UT=46-16') =23=3D> Returns a string using the= encoding passed ----- I know it seems like this class is just wrapping String and always defaul= ting to byte-wise operations, but it's more fundamental than that. Becaus= e there is no encoding on the bytes, there will never be an encoding erro= r when working with them. This could be extremely useful for applications= that combine bytes from multiple sources (e.g. Socket data + a file on d= isk + immediate strings in code) that could potentially have different en= codings. By operating on bytes, you can defer the encoding checks until l= ater, if at all. - Josh --4f9d713e_5d5babb3_103 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline
On S= aturday, April 28, 2012 at 8:52 AM, KOSAKI Motohiro wrote:
On =46ri, Apr 27, 2012 at 8:53 P= M, MartinBosslet (Martin Bosslet)

Issue =23= 6361 has been updated by MartinBosslet (Martin Bosslet).


nobu (Nobuyoshi Nakada) wrote:
Then what kind of methods should Blob have=3F<= /div>

And does it need to be built-in=3F

A real advantage of having it built-in cou= ld be
that this gives us the chance to fix =235741 at
the same time. I could imagine that we have two
kinds of =22by= te array=22 classes - one, mutable,
that shares COW semantics a= nd all the other
optimizations with String, but with no notion = of
encoding and a yet-to-be-defined interface.

And then a second class, which is basically the
immuta= ble version of the first one. By sharing
only a reference we co= uld ensure that the content
would not be proliferated and we co= uld securely
erase its contents after use.

I don't dislike a bult-in idea. But you haven't s= how a detailed spec
and I don't think I clearly understand your= idea. Can you spend a
few time for writing a spec=3F (probably= rough a few line explanation
is enough)
=20 =20 =20 =20

If I may intrude for a moment=E2=80=A6 I think= the advantage to having a built in Data/Blob library would be = that it could be used in all places where a data class is more appropriat= e than a string. =46or example, the Socket library currently returns Stri= ngs for data read in from a socket. I think a Data class is more appropri= ate here since the socket itself does not contain encoding information (i= .e. either an arbitrary default encoding needs to be set, a heuristic can= be used to guess the encoding, or the encoding is set by a previously ag= reed up convention; but you cannot ask a socket for its encoding).
<= div>
As for a spec, I think it should be kept relatively si= mple. The one interesting optimization from NSData that might be useful i= s the option of copying bytes on instantiation. Copying is the default, b= ut it is also possible to create a Data object that merely points at the = storage of another live object and allows byte-wise manipulation. This is= particularly interesting for the case of strings, since I would guess th= at String and Data would have identical storage layout, allowing one to o= ptimize the case of creating a Data from a String with no copying.
<= div>
A quick attempt at a spec:

--= ---
Data.new =23=3D> New, dynamically resizable container to= store some bytes
Data.new('Test') =23=3D> Can be created fr= om any object that responds to =23bytes with an enumerator
Data= .new('Hello', copy=5Fbytes: false) =23=3D> Creates the Data from the S= tring by merely pointing to the same storage

Dat= a.open('./foo/test.txt') =23=3D> Create a Data object from a =46ile
Data.open('./bar/test.txt', copy=5Fbytes: false) =23=3D> Same = as open above, but manipulates IO=23pos for access
Data.write('= ./baz/test.txt') =23=3D> Writes the bytes to disk.

d =3D Data.new(a=5Fstring)
d=5B2=5D =23=3D> Returns th= e third byte, same as a=5Fstring.bytes.to=5Fa=5B2=5D
d=5B2=5D =3D= 42 =23=3D> Same as a=5Fstring.setbyte(2, 42)
d.each =23=3D&= gt; Equivalent to a=5Fstring.each=5Fbyte
d.length =23=3D> Nu= mber of bytes currently being stored
d.slice(2, 4) =23=3D> S= imilar to String=23slice
d.slice(2, 4, copy=5Fbytes: false) =23= =3D> New data object from slice shares storage with the original
=
d << other=5Fdata =23=3D> Appends bytes from other=5Fdata
d.to=5Fs =23=3D> Returns a string using the default internal = encoding
d.string=5Fwith=5Fencoding('UT=46-16') =23=3D> Retu= rns a string using the encoding passed
-----

I know it seems like this class is just wrapping String and always= defaulting to byte-wise operations, but it's more fundamental than that.= Because there is no encoding on the bytes, there will never be an encodi= ng error when working with them. This could be extremely useful for appli= cations that combine bytes from multiple sources (e.g. Socket data + a fi= le on disk + immediate strings in code) that could potentially have diffe= rent encodings. By operating on bytes, you can defer the encoding checks = until later, if at all.

- Josh
--4f9d713e_5d5babb3_103--