From: duerst@...
Date: 2017-01-09T09:35:28+00:00
Subject: [ruby-core:79025] [Ruby trunk Feature#13110] Byte-based operations	for String

Issue #13110 has been updated by Martin D��rst.


Shugo Maeda wrote:
> Let me clarify my intention.
> 
> I'd like to handle not only singlebyte characters but multibyte
> characters efficiently by byte-based operations.

What about using UTF-32? It will use some additional memory, but give you the speed you want.


> Once a string is scanned, we have a byte offset, so we don't need
> scan the string from the beginning, but we are forced to do it by
> the current API.

One way to improve this is to somehow cache the last used character and byte index for a string. I think Perl does something like this.

This could be expanded to a string with several character index/byte index pairs cached, which could be searched by binary search. All this could (should!) be totally opaque to the Ruby programmer (except for the speedup).

Another way would be to return an Index object that keeps the character and byte indices opaque, but can be used in a general way where speedups are needed.


> In the following example, the byteindex version is much faster than
> the index version.

Of course it is. (Usually programs in C are faster than programs in Ruby, and this is just moving closer to C, and thus getting faster.)

But what I'm wondering is that using a single string for the data in an editor buffer may still be quite inefficient. Adding or deleting a character in the middle of the buffer will be slow, even if you know the exact position in bytes. Changing the representation e.g. to an array of lines will make the efficiency mostly go away. (After all, editors need only be as fast as humans can type :-).


More generally, what I'm afraid of is that with this, we start to more and more expose String internals. That can easily lead to problems.

Some people may copy a Ruby snippet using byteindex, then add 1 to that index because they think that's how to get to the next character. Others may start to use byteindex everywhere, even if it's absolutely not necessary. Others may demand byte- versions of more and more operations on strings. We have seen all of this in other contexts.


----------------------------------------
Feature #13110: Byte-based operations for String
https://bugs.ruby-lang.org/issues/13110#change-62433

* Author: Shugo Maeda
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
How about to add byte-based operations for String?

```ruby
s = "���������������������������"
p s.byteindex(/������/, 4) #=> 18
x, y = Regexp.last_match.byteoffset(0) #=> [18, 24]
s.bytesplice(x...y, "���������")
p s #=> "������������������������������"
```


---Files--------------------------------
byteindex.diff (2.83 KB)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>