From: eregontp@...
Date: 2018-12-28T17:11:22+00:00
Subject: [ruby-core:90775] [Ruby trunk Feature#14975] String#append without changing receiver's encoding

Issue #14975 has been updated by Eregon (Benoit Daloze).


duerst (Martin D��rst) wrote:
> In any way, when adding stuff to a (BINARY) buffer, the right thing conceptually is to change all the pieces to BINARY, not to rely on some of the pieces (be it the first or another) to be BINARY.

If the LHS is not BINARY then it can't reasonably be caller a "binary buffer", so let's assume RHS is binary.

I think we all agree `binary_string << whatever` should leave `binary_string.encoding` as `BINARY` (even if binary_string only contains all <128 bytes initially).
I wonder what would break if we changed this specific behavior.
Maybe somebody could try and report what failures we get from the test suites?
Things like `binary << utf8` changing `binary`'s encoding to `UTF-8` seems nonsense and never what anyone wants (UTF-8 is not a superset of BINARY, some byte sequences are invalid in UTF-8).

OTOH `usascii << utf8` changing the LHS's encoding to `UTF-8` seems much more reasonable (it's still "text", not binary data, and UTF-8 is a clear superset of US-ASCII).

Maybe negotiating a compatible encoding by finding a superset (although I'd raise on `ISO-8859-1 << UTF-8` rather than use BINARY, because it cannot work well if both have non-US-ASCII characters),
and only considering the String#encoding and not the coderange (the range of the bytes) would be a way to have clear semantics for appends.

> The problem with that is that it either changes the appended string's encoding (with `.force_encoding 'BINARY'`) or needs another copy (with `.b`).

If #append is changed or a new method added, there is no need for any extra copying of course, the bytes are just copied in the receiver buffer.
But the allocation should be pretty cheap (and it can be escape-analyzed), and no copy or scan should be needed for `super(string.b)`.

----------------------------------------
Feature #14975: String#append without changing receiver's encoding
https://bugs.ruby-lang.org/issues/14975#change-75949

* Author: ioquatix (Samuel Williams)
* Status: Open
* Priority: Normal
* Assignee: ioquatix (Samuel Williams)
* Target version: 2.7
----------------------------------------
I'm not sure where this fits in, but in order to avoid garbage and superfluous function calls, is it possible that `String#<<`, `String#concat` or the (proposed) `String#append` can avoid changing the encoding of the receiver?

Right now it's very tricky to do this in a way that doesn't require extra allocations. Here is what I do:

```ruby
class Buffer < String
	BINARY = Encoding::BINARY
	
	def initialize
		super
		
		force_encoding(BINARY)
	end
	
	def << string
		if string.encoding == BINARY
			super(string)
		else
			super(string.b) # Requires extra allocation.
		end
		
		return self
	end
	
	alias concat <<
end
```

When the receiver is binary, but contains byte sequences, appending UTF_8 can fail:

```
"Foobar".b << "F����bar"
=> "FoobarF����bar"

> "F����bar".b << "F����bar"
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
```

So, it's not possible to append data, generally, and then call `force_encoding(Encoding::BINARY)`. One must ensure the string is binary before appending it.

It would be nice if there was a solution which didn't require additional allocations/copies/linear scans for what should basically be a `memcpy`.

See also: https://bugs.ruby-lang.org/issues/14033 and https://bugs.ruby-lang.org/issues/13626#note-3

There are two options to fix this:

1/ Don't change receiver encoding in any case.
2/ Apply 1, but only when receiver is using `Encoding::BINARY`


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>