From: "tenderlovemaking (Aaron Patterson)" <noreply@...>
Date: 2022-02-10T16:34:56+00:00
Subject: [ruby-core:107554] [Ruby master Feature#18579] Concatenation of ASCII-8BIT strings shouldn't behave differently depending on string contents

Issue #18579 has been updated by tenderlovemaking (Aaron Patterson).


duerst (Martin D��rst) wrote in #note-8:
> I agree that concatenating an ASCII-8BIT string with a non-ASCII-8BIT string is usually a bug. That's because ASCII-8BIT usually stands for BINARY. But if it stands for BINARY, then also
> ```Ruby
> concat "bar".encode("US-ASCII"),  "foo".b                     # Return value encoding is LHS, US-ASCII
> concat "������",                    "foo".b                     # Return value encoding is LHS, UTF-8
> ```
> should produce an error, because "foo" in that case is just a way to show some byte values, and doesn't actually represent the three letters "f", "o", and "o".
> 
> So I think we either have to keep the idea that the first 128 values of ASCII-8BIT/BINARY mean ASCII characters, or we have to abandon that idea, but we have to be consistent.

I agree those two cases should raise an exception, but I thought a less aggressive proposal would have a greater than 0 chance of being accepted ����

> Maybe a command-line flag to debug bad BINARY usages would be helpful, and that could e.g. raise on concatenating a string of any other encoding with a BINARY String.

This could probably satisfy my needs.  As I said, the underlying problem is that a binary string can enter the application, and due to the "infectious nature" of the encoding system, an exception can get raised but it is *far* from the origin of the binary string.

The binary string could have come from reading a file, reading from the network, decoding a URI or an HTML form (arbitrary binary data can be percent encoded), etc.  In a large system it's hard to tell what the origin of the binary string is, and the correct fix is to patch the code at the origin of the string.


----------------------------------------
Feature #18579: Concatenation of ASCII-8BIT strings shouldn't behave differently depending on string contents
https://bugs.ruby-lang.org/issues/18579#change-96466

* Author: tenderlovemaking (Aaron Patterson)
* Status: Rejected
* Priority: Normal
----------------------------------------
Currently strings tagged with ASCII-8BIT will behave differently when concatenating depending on the string contents.

When concatenating strings the resulting string has the encoding of the LHS.  For example:

```
z = a + b
```

`z` will have the encoding of `a` (if the encodings are compatible).


However `ASCII-8BIT` behaves differently.  If `b` has "ASCII-8BIT" encoding, then the encoding of `z` will sometimes be the encoding of `a`, sometimes it will be the encoding of `b`, and sometimes it will be an exception.

Here is an example program:

```ruby
def concat a, b
  str = a + b
  str
end

concat "bar",                     "foo".encode("US-ASCII")    # Return value encoding is LHS, UTF-8
concat "bar".encode("US-ASCII"),  "foo".b                     # Return value encoding is LHS, US-ASCII
concat "������",                    "foo".b                     # Return value encoding is LHS, UTF-8
concat "bar",                     "bad\376\377str".b          # Return value encoding is RHS, ASCII-8BIT.  Why?
concat "������",                    "bad\376\377str".b          # Exception
```

This behavior is too hard to understand.  Usually we think LHS encoding will win, or there will be an exception. Even worse is that string concatenation can "infect" strings.  For example:


```ruby
def concat a, b
  str = a + b
  str
end

str = concat "bar", "bad\376\377str".b # this worked
p str
str = concat "������", str               # exception
p str
```

The first concatenation succeeded, but the second one failed.  As a developer it is difficult to find where the "bad string" was introduced.  In the above example, the string may have been read from the network, but by the time an exception is raised it is far from where the "bad string" originated.  In the above example, the bad data came from like 6, but the exception was raised on line 8.

I propose that ASCII-8BIT strings raise an exception if they cannot be converted in to the LHS encoding.  So the above program would become like this:

```ruby
def concat a, b
  str = a + b
  str
end

concat "bar",                     "foo".encode("US-ASCII")    # Return value encoding is LHS, UTF-8
concat "bar".encode("US-ASCII"),  "foo".b                     # Return value encoding is LHS, US-ASCII
concat "������",                    "foo".b                     # Return value encoding is LHS, UTF-8
concat "bar",                     "bad\376\377str".b          # Exception <--- NEW!!
concat "������",                    "bad\376\377str".b          # Exception
```


I'm open to other solutions, but the underlying issue is that concatenating an ASCII-8BIT string with a non-ASCII-8BIT string is usually a bug and by the time an exception is raised, it is very far from the origin of the string.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>