ruby-core

Issue #18931 has been reported by nirvdrum (Kevin Menard).

----------------------------------------
Bug #18931: Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip
https://bugs.ruby-lang.org/issues/18931

* Author: nirvdrum (Kevin Menard)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
* Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
When attempting to strip a string, there are three basic options when an invalid code point is encountered:

1) Ignore the code point
2) Strip the code point
3) Raise an exception

For background, Ruby does not consider the string's code range for `lstrip` or `rstrip`. It permits stripping strings with a `ENC_CODERANGE_BROKEN` so long as any invalid code points are not encountered while performing the loop to remove whitespace. What it does when such a code point is encountered, however, is not consistent between `lstrip` and `rstrip`.

`String#lstrip` will unconditionally raise an invalid byte sequence error:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p " \x80abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p " \x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e ' p " a\x80bc".lstrip'
"a\x80bc"   # This one is okay because the broken code point appears after a non-whitespace code point.
```

Things get a lot messier with `String#rstrip`, however. Depending on context, `rstrip` may raise an exception, treat the broken code point as a non-whitespace boundary and stop processing, or treat the broken code point as if it were whitespace and remove it.

`String#rstrip` will ignore the invalid code point if it immediately follows a non-whitespace code point:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "abc\x80 ".rstrip'
"abc\x80"

> ruby -e 'p "abc\x80".rstrip'
"abc\x80"
```

`String#rstrip` will remove the invalid code point if it is surround by whitespace:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "abc \x80".rstrip'
"abc"

> ruby -e 'p "abc \x80 ".rstrip'
"abc"

> ruby -e 'p " \x80 ".rstrip'
""
```

`String#rstrip` will raise an exception if no valid, non-whitespace code points appear before it:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "\x80 ".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'
```

It looks to me like the current behavior is a byproduct of the functions chosen for finding code point boundaries, rather than something deliberately chosen. E.g., `rb_str_lstrip` will call `rb_enc_codepoint_len`, which raises on invalid code points, while `rb_str_rstrip` calls `rb_enc_prev_char`, which doesn't perform the same code point validation.  I think it'd make for a better user experience if `lstrip` and `rstrip` behaved consistently with each other, which would then unify the behavior in `rstrip`. What that behavior should be needs to be decided and I'm hoping to reach consensus on the semantics in this issue.





-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

Thread

Prev Next

In This Thread

Prev Next