From: "Eregon (Benoit Daloze)" <noreply@...> Date: 2022-02-10T14:07:44+00:00 Subject: [ruby-core:107552] [Ruby master Feature#18579] Concatenation of ASCII-8BIT strings shouldn't behave differently depending on string contents Issue #18579 has been updated by Eregon (Benoit Daloze). In short, it's not specific to the binary encoding at all. It's the same behavior with e.g. UTF-8 + ISO-8859-1. Another way to express the rules is "if the String only has 7-bit ASCII characters", treat it as if it was US-ASCII. Appending US-ASCII characters to any ASCII-compatible string is of course fine and vice versa. So basically concatenation works whenever it can currently, and that's the current model for users. It only raises if the encodings are not the same and (both sides have non-7-bit characters OR one side is ascii-incompatible). The most specific encoding is picked if there is a choice (US-ASCII + UTF-8 -> UTF-8 is natural, isn't it?), and if all other things equal it'd be the LHS encoding. There is a weird edge case which @nirvdrum mentions, that empty strings are considered CR_7BIT even if their encoding is not ascii-compat, I think we should try to remove that special case. The bigger question is whether Encoding::BINARY should be ascii-compatible. It always was, and I think changing that would break the world, but might still be worth an experiment. Maybe a command-line flag to debug bad BINARY usages would be helpful, and that could e.g. raise on concatenating a string of any other encoding with a BINARY String. --- I think part of the issue is that: ```ruby "abc".b + "��t��" # => UTF-8 and not BINARY # same for "abc".b << "��t��" ``` which relates to #14975. Maybe `anything + BINARY -> BINARY` and `BINARY + anything -> BINARY` would be more intuitive/safer, but it would also just make the rules more complicated, isn't it? Also that would accept anything on the other side, while currently it raises which seems more helpful (`"ab��".b + "��t��" # => Encoding::CompatibilityError`). --- Avoiding binary Strings in core, for example for exception messages (using US-ASCII or UTF-8 instead) is I think something worth doing independently. TruffleRuby already uses UTF-8 String for exception messages. ---------------------------------------- Feature #18579: Concatenation of ASCII-8BIT strings shouldn't behave differently depending on string contents https://bugs.ruby-lang.org/issues/18579#change-96464 * Author: tenderlovemaking (Aaron Patterson) * Status: Rejected * Priority: Normal ---------------------------------------- Currently strings tagged with ASCII-8BIT will behave differently when concatenating depending on the string contents. When concatenating strings the resulting string has the encoding of the LHS. For example: ``` z = a + b ``` `z` will have the encoding of `a` (if the encodings are compatible). However `ASCII-8BIT` behaves differently. If `b` has "ASCII-8BIT" encoding, then the encoding of `z` will sometimes be the encoding of `a`, sometimes it will be the encoding of `b`, and sometimes it will be an exception. Here is an example program: ```ruby def concat a, b str = a + b str end concat "bar", "foo".encode("US-ASCII") # Return value encoding is LHS, UTF-8 concat "bar".encode("US-ASCII"), "foo".b # Return value encoding is LHS, US-ASCII concat "������", "foo".b # Return value encoding is LHS, UTF-8 concat "bar", "bad\376\377str".b # Return value encoding is RHS, ASCII-8BIT. Why? concat "������", "bad\376\377str".b # Exception ``` This behavior is too hard to understand. Usually we think LHS encoding will win, or there will be an exception. Even worse is that string concatenation can "infect" strings. For example: ```ruby def concat a, b str = a + b str end str = concat "bar", "bad\376\377str".b # this worked p str str = concat "������", str # exception p str ``` The first concatenation succeeded, but the second one failed. As a developer it is difficult to find where the "bad string" was introduced. In the above example, the string may have been read from the network, but by the time an exception is raised it is far from where the "bad string" originated. In the above example, the bad data came from like 6, but the exception was raised on line 8. I propose that ASCII-8BIT strings raise an exception if they cannot be converted in to the LHS encoding. So the above program would become like this: ```ruby def concat a, b str = a + b str end concat "bar", "foo".encode("US-ASCII") # Return value encoding is LHS, UTF-8 concat "bar".encode("US-ASCII"), "foo".b # Return value encoding is LHS, US-ASCII concat "������", "foo".b # Return value encoding is LHS, UTF-8 concat "bar", "bad\376\377str".b # Exception <--- NEW!! concat "������", "bad\376\377str".b # Exception ``` I'm open to other solutions, but the underlying issue is that concatenating an ASCII-8BIT string with a non-ASCII-8BIT string is usually a bug and by the time an exception is raised, it is very far from the origin of the string. -- https://bugs.ruby-lang.org/ Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe> <http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>