From: shugo@... Date: 2017-01-20T12:26:22+00:00 Subject: [ruby-core:79196] [Ruby trunk Feature#13110] Byte-based operations for String Issue #13110 has been updated by Shugo Maeda. > The "buffer gap" technique is very well known, I'm familiar with it since the early 90ies. I was thinking about it, but I think it won't work with UTF-8. If you have figured out how you would make it work with UTF-8, then please tell us. > > Here is why I think it won't work with UTF-8. The problem is that you can't move characters from before the gap to after or the other way round and change them when there are edits. If some characters are changed, they might change their byte length. But if you want to keep the string as valid UTF-8, you have to constantly fix the content of the gap. One could imagine using two separate String objects, one for before the gap and one for after. For before the gap, it actually might work quite well (as long as Ruby doesn't shorten the memory allocated to a string when the string contents is truncated), but for after the gap, it won't work, because every insertion or deletion at the end of the gap will make the string contents shift around. In my implementation, the gap is kept filled with NUL, and moved to the end of the buffer when regular expression search is needed. > > > More generally, what I'm afraid of is that with this, we start to more and more expose String internals. That can easily lead to problems. > > > > > > Some people may copy a Ruby snippet using byteindex, then add 1 to that index because they think that's how to get to the next character. Others may start to use byteindex everywhere, even if it's absolutely not necessary. Others may demand byte- versions of more and more operations on strings. We have seen all of this in other contexts. > > > > Doesn't this concern apply to `byteslice`? > > Yes, it does. The less we have of such kinds of methods, the better. > > Anyway, one more question: Are you really having performance problems, or are you just worried about performance? Compared to today's hardware speed, human editing is extremely slow, and for most operations, there should be on delay whatever. Using character indices was slow, but my current implementation uses ASCII-8BIT strings whose contents are is encoded in UTF-8, so there's no performance problem while editing Japanese text whose size is over 10MB. However, the implementation has the following terrible method: ```ruby def byteindex(forward, re, pos) @match_offsets = [] method = forward ? :index : :rindex adjust_gap(0, point_max) if @binary offset = pos else offset = @contents[0...pos].force_encoding(Encoding::UTF_8).size @contents.force_encoding(Encoding::UTF_8) end begin i = @contents.send(method, re, offset) if i m = Regexp.last_match if m.nil? # A bug of rindex @match_offsets.push([pos, pos]) pos else b = m.pre_match.bytesize e = b + m.to_s.bytesize if e <= bytesize @match_offsets.push([b, e]) match_beg = m.begin(0) match_str = m.to_s (1 .. m.size - 1).each do |j| cb, ce = m.offset(j) if cb.nil? @match_offsets.push([nil, nil]) else bb = b + match_str[0, cb - match_beg].bytesize be = b + match_str[0, ce - match_beg].bytesize @match_offsets.push([bb, be]) end end b else nil end end else nil end ensure @contents.force_encoding(Encoding::ASCII_8BIT) end end ``` As long as copy-on-write works, the performance of the code would not be so bad, but it looks terrible. A text editor is just an example, and my take is that ways to get byte offsets should be provided because we already have byteslice. Otherwise, byteslice is not so useful. ---------------------------------------- Feature #13110: Byte-based operations for String https://bugs.ruby-lang.org/issues/13110#change-62620 * Author: Shugo Maeda * Status: Open * Priority: Normal * Assignee: * Target version: ---------------------------------------- How about to add byte-based operations for String? ```ruby s = "���������������������������" p s.byteindex(/������/, 4) #=> 18 x, y = Regexp.last_match.byteoffset(0) #=> [18, 24] s.bytesplice(x...y, "���������") p s #=> "������������������������������" ``` ---Files-------------------------------- byteindex.diff (2.83 KB) -- https://bugs.ruby-lang.org/ Unsubscribe: