From: "ioquatix (Samuel Williams)" <noreply@...>
Date: 2022-09-26T12:45:17+00:00
Subject: [ruby-core:110087] [Ruby master Feature#14900] Extra allocation in String#byteslice

Issue #14900 has been updated by ioquatix (Samuel Williams).


Okay, so @byroot and I discussed this issue at length.

The simplest way to get the behaviour you want is to call `freeze` on the source string.

```ruby
require "objspace"

string = "a" * 100_000

GC.start
GC.disable
generation = GC.count

string.freeze # ADD THIS LINE

ObjectSpace.trace_object_allocations do
  string.byteslice(50_000..-1)

  ObjectSpace.each_object(String) do |string|
    p string.bytesize if ObjectSpace.allocation_generation(string) == generation
  end
end
```

This has the desired result that the source string is not copied, but is internally used as a shared buffer for the `string.byteslice`. However, it prevents you from modifying the source string.

The reason why it works this way, is because as @funny_falcon pointed out, many people are slicing the front off a buffer several times. If the buffer is huge, this would result in many large `memcpy`.

By freezing the string you avoid this memcpy. Alternatively, `byteslice` does this for you internally. It's a bit more nuanced than that because smaller strings are always copied, so this only kicks in for larger strings.

So I wanted to think a bit about how to do this efficiently - I think something like this can be pretty good:

``` ruby
buffer = String.new # allocation
while true
	# Efficiently read into the buffer:
	if buffer.empty?
		io.read(1024, buffer)
	else
		buffer << io.read(1024)
	end

	# Freeze the buffer so it will be shared during processing:
	buffer.freeze

	# Consume the buffer in chunks:
	while size = consume(buffer)
		buffer = buffer.byteslice(size..-1) # shared root string - no memcpy or allocation
	end

	# Unfreeze the buffer if needed.
	buffer = +buffer
end
```

The proposed PR basically skips the internal sharing mechanism unless you call `buffer.freeze`. In current Ruby, it's optional, and if you don't freeze it, Ruby is forced to create an internal dup, which is what you are seeing.

We should investigate the performance of the typical IO usage, to see which way is better.


----------------------------------------
Feature #14900: Extra allocation in String#byteslice
https://bugs.ruby-lang.org/issues/14900#change-99342

* Author: janko (Janko Marohni��)
* Status: Open
* Priority: Normal
----------------------------------------
When executing `String#byteslice` with a range, I noticed that sometimes the original string is allocated again. When I run the following script:

~~~ ruby
require "objspace"

string = "a" * 100_000

GC.start
GC.disable
generation = GC.count

ObjectSpace.trace_object_allocations do
  string.byteslice(50_000..-1)

  ObjectSpace.each_object(String) do |string|
    p string.bytesize if ObjectSpace.allocation_generation(string) == generation
  end
end
~~~

it outputs

~~~
50000
100000
6
5
~~~

The one with 50000 bytes is the result of `String#byteslice`, but the one with 100000 bytes is the duplicated original string. I expected only the result of `String#byteslice` to be amongst new allocations.

If instead of the last 50000 bytes I slice the *first* 50000 bytes, the extra duplication doesn't occur.

~~~ ruby
# ...
  string.byteslice(0, 50_000)
# ...
~~~

~~~
50000
5
~~~

It's definitely ok if the implementation of `String#bytesize` allocates extra strings as part of the implementation, but it would be nice if they were deallocated before returning the result.

EDIT: It seems that `String#slice` has the same issue.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>