From: shugo@...
Date: 2017-01-20T12:26:22+00:00
Subject: [ruby-core:79196] [Ruby trunk Feature#13110] Byte-based operations	for String

Issue #13110 has been updated by Shugo Maeda.


> The "buffer gap" technique is very well known, I'm familiar with it since the early 90ies. I was thinking about it, but I think it won't work with UTF-8. If you have figured out how you would make it work with UTF-8, then please tell us.
> 
> Here is why I think it won't work with UTF-8. The problem is that you can't move characters from before the gap to after or the other way round and change them when there are edits. If some characters are changed, they might change their byte length. But if you want to keep the string as valid UTF-8, you have to constantly fix the content of the gap. One could imagine using two separate String objects, one for before the gap and one for after. For before the gap, it actually might work quite well (as long as Ruby doesn't shorten the memory allocated to a string when the string contents is truncated), but for after the gap, it won't work, because every insertion or deletion at the end of the gap will make the string contents shift around.

In my implementation, the gap is kept filled with NUL, and moved to the end of the buffer
when regular expression search is needed.

> > > More generally, what I'm afraid of is that with this, we start to more and more expose String internals. That can easily lead to problems.
> > > 
> > > Some people may copy a Ruby snippet using byteindex, then add 1 to that index because they think that's how to get to the next character. Others may start to use byteindex everywhere, even if it's absolutely not necessary. Others may demand byte- versions of more and more operations on strings. We have seen all of this in other contexts.
> > 
> > Doesn't this concern apply to `byteslice`?
> 
> Yes, it does. The less we have of such kinds of methods, the better.
> 
> Anyway, one more question: Are you really having performance problems, or are you just worried about performance? Compared to today's hardware speed, human editing is extremely slow, and for most operations, there should be on delay whatever.

Using character indices was slow, but my current implementation uses ASCII-8BIT strings
whose contents are is encoded in UTF-8, so there's no performance problem while editing
Japanese text whose size is over 10MB.

However, the implementation has the following terrible method:

```ruby
    def byteindex(forward, re, pos)
      @match_offsets = []
      method = forward ? :index : :rindex
      adjust_gap(0, point_max)
      if @binary
        offset = pos
      else
        offset = @contents[0...pos].force_encoding(Encoding::UTF_8).size
        @contents.force_encoding(Encoding::UTF_8)
      end
      begin
        i = @contents.send(method, re, offset)
        if i
          m = Regexp.last_match
          if m.nil?
            # A bug of rindex
            @match_offsets.push([pos, pos])
            pos
          else
            b = m.pre_match.bytesize
            e = b + m.to_s.bytesize
            if e <= bytesize
              @match_offsets.push([b, e])
              match_beg = m.begin(0)
              match_str = m.to_s
              (1 .. m.size - 1).each do |j|
                cb, ce = m.offset(j)
                if cb.nil?
                  @match_offsets.push([nil, nil])
                else
                  bb = b + match_str[0, cb - match_beg].bytesize
                  be = b + match_str[0, ce - match_beg].bytesize
                  @match_offsets.push([bb, be])
                end
              end
              b
            else
              nil
            end
          end
        else
          nil
        end
      ensure
        @contents.force_encoding(Encoding::ASCII_8BIT)
      end
    end
```

As long as copy-on-write works, the performance of the code would not be so bad, but it
looks terrible.

A text editor is just an example, and my take is that ways to get byte offsets should be
provided because we already have byteslice.  Otherwise, byteslice is not so useful.


----------------------------------------
Feature #13110: Byte-based operations for String
https://bugs.ruby-lang.org/issues/13110#change-62620

* Author: Shugo Maeda
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
How about to add byte-based operations for String?

```ruby
s = "���������������������������"
p s.byteindex(/������/, 4) #=> 18
x, y = Regexp.last_match.byteoffset(0) #=> [18, 24]
s.bytesplice(x...y, "���������")
p s #=> "������������������������������"
```


---Files--------------------------------
byteindex.diff (2.83 KB)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>