From: "Eregon (Benoit Daloze)" Date: 2022-08-23T13:44:19+00:00 Subject: [ruby-core:109642] [Ruby master Bug#18972] String#byteslice should return BINARY (aka ASCII-8BIT) Strings Issue #18972 has been updated by Eregon (Benoit Daloze). I think the current behavior is better, `String#byteslice` is not only used for BINARY strings. In fact for binary strings (and other fixed-width encodings), there is no point to use byteslice over slice/[]. For instance, one might work with UTF-8 and get a byte index (instead of a character index), from e.g. `String#byteindex` or from `MatchData#byteoffset`, and then one would use `byteslice` to avoid 2 extra byte offset<->character offset conversions, which e.g. are expensive for (non-7-bit) UTF-8. What I just described is close to the motivation for #13110 which added `String#byteindex`. So I think we cannot change this for compatibility, and it is intended AFAIK. ---------------------------------------- Bug #18972: String#byteslice should return BINARY (aka ASCII-8BIT) Strings https://bugs.ruby-lang.org/issues/18972#change-98866 * Author: byroot (Jean Boussier) * Status: Open * Priority: Normal * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- While working on implementing https://bugs.ruby-lang.org/issues/13626, I noticed `byteslice` assign the receiver encoding to the returned String. I believe this is incorrect, as since you are doing a byte based operation, you do expect a binary string in return, otherwise if you'd call it on an UTF-8 string, you'd likely get a string with invalid encoding. I read the original feature request and there's no mention of what the returned encoding should be: https://bugs.ruby-lang.org/issues/4447 ### Current behavior ```ruby >> "f��e".byteslice(1).valid_encoding? => false >> "f��e".byteslice(1).encoding => # ``` ### Expected behavior ```ruby >> "f��e".byteslice(1).valid_encoding? => true >> "f��e".byteslice(1).encoding => # ``` ### Backward compatibility concerns I'm honestly not quite sure what the backward incompatibility impact may be. From my point of view if you are calling `byteslice` it's to use it with other binary string, but it's indeed possible that there is existing code mixing UTF-8 and BINARY that somewhat work and would be broken by this change. Especially since binary strings can silently be promoted from BINARY to UTF-8: ```ruby buffer = "".b buffer << "f��e" # buffer was promoted to Encoding::UTF-8 silently buffer << "f��e".byteslice(1) ``` The above currently "works", but would raise `Encoding::CompatibilityError` with this change. -- https://bugs.ruby-lang.org/ Unsubscribe: