From: "Eregon (Benoit Daloze)" <noreply@...>
Date: 2022-08-23T13:44:19+00:00
Subject: [ruby-core:109642] [Ruby master Bug#18972] String#byteslice should return BINARY (aka ASCII-8BIT) Strings

Issue #18972 has been updated by Eregon (Benoit Daloze).


I think the current behavior is better, `String#byteslice` is not only used for BINARY strings.
In fact for binary strings (and other fixed-width encodings), there is no point to use byteslice over slice/[].

For instance, one might work with UTF-8 and get a byte index (instead of a character index), from e.g. `String#byteindex` or from `MatchData#byteoffset`, and then one would use `byteslice` to avoid 2 extra byte offset<->character offset conversions, which e.g. are expensive for (non-7-bit) UTF-8.
What I just described is close to the motivation for #13110 which added `String#byteindex`.

So I think we cannot change this for compatibility, and it is intended AFAIK.

----------------------------------------
Bug #18972: String#byteslice should return BINARY (aka ASCII-8BIT) Strings
https://bugs.ruby-lang.org/issues/18972#change-98866

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
While working on implementing https://bugs.ruby-lang.org/issues/13626, I noticed `byteslice` assign the receiver encoding to the returned String.

I believe this is incorrect, as since you are doing a byte based operation, you do expect a binary string in return, otherwise if you'd call it on an UTF-8 string, you'd likely get a string with invalid encoding.

I read the original feature request and there's no mention of what the returned encoding should be: https://bugs.ruby-lang.org/issues/4447


### Current behavior

```ruby
>> "f��e".byteslice(1).valid_encoding?
=> false
>> "f��e".byteslice(1).encoding
=> #<Encoding:UTF-8>
```

### Expected behavior

```ruby
>> "f��e".byteslice(1).valid_encoding?
=> true
>> "f��e".byteslice(1).encoding
=> #<Encoding:ASCII-8BIT>
```

### Backward compatibility concerns

I'm honestly not quite sure what the backward incompatibility impact may be.

From my point of view if you are calling `byteslice` it's to use it with other binary string, but it's indeed
possible that there is existing code mixing UTF-8 and BINARY that somewhat work and would be broken by this change.

Especially since binary strings can silently be promoted from BINARY to UTF-8:

```ruby
buffer = "".b 
buffer << "f��e" # buffer was promoted to Encoding::UTF-8 silently
buffer << "f��e".byteslice(1)
```

The above currently "works", but would raise `Encoding::CompatibilityError` with this change.





-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>