From: "nirvdrum (Kevin Menard)" Date: 2022-07-20T16:25:04+00:00 Subject: [ruby-core:109264] [Ruby master Bug#18931] Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip Issue #18931 has been reported by nirvdrum (Kevin Menard). ---------------------------------------- Bug #18931: Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip https://bugs.ruby-lang.org/issues/18931 * Author: nirvdrum (Kevin Menard) * Status: Open * Priority: Normal * ruby -v: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21] * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- When attempting to strip a string, there are three basic options when an invalid code point is encountered: 1) Ignore the code point 2) Strip the code point 3) Raise an exception For background, Ruby does not consider the string's code range for `lstrip` or `rstrip`. It permits stripping strings with a `ENC_CODERANGE_BROKEN` so long as any invalid code points are not encountered while performing the loop to remove whitespace. What it does when such a code point is encountered, however, is not consistent between `lstrip` and `rstrip`. `String#lstrip` will unconditionally raise an invalid byte sequence error: ``` > ruby -v ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21] > ruby -e 'p " \x80abc".lstrip' -e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `
' > ruby -e 'p " \x80 abc".lstrip' -e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `
' > ruby -e 'p "\x80 abc".lstrip' -e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `
' > ruby -e 'p "\x80".lstrip' -e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `
' > ruby -e ' p " a\x80bc".lstrip' "a\x80bc" # This one is okay because the broken code point appears after a non-whitespace code point. ``` Things get a lot messier with `String#rstrip`, however. Depending on context, `rstrip` may raise an exception, treat the broken code point as a non-whitespace boundary and stop processing, or treat the broken code point as if it were whitespace and remove it. `String#rstrip` will ignore the invalid code point if it immediately follows a non-whitespace code point: ``` > ruby -v ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21] > ruby -e 'p "abc\x80 ".rstrip' "abc\x80" > ruby -e 'p "abc\x80".rstrip' "abc\x80" ``` `String#rstrip` will remove the invalid code point if it is surround by whitespace: ``` > ruby -v ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21] > ruby -e 'p "abc \x80".rstrip' "abc" > ruby -e 'p "abc \x80 ".rstrip' "abc" > ruby -e 'p " \x80 ".rstrip' "" ``` `String#rstrip` will raise an exception if no valid, non-whitespace code points appear before it: ``` > ruby -v ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21] > ruby -e 'p "\x80 ".rstrip' -e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `
' > ruby -e 'p "\x80".rstrip' -e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `
' ``` It looks to me like the current behavior is a byproduct of the functions chosen for finding code point boundaries, rather than something deliberately chosen. E.g., `rb_str_lstrip` will call `rb_enc_codepoint_len`, which raises on invalid code points, while `rb_str_rstrip` calls `rb_enc_prev_char`, which doesn't perform the same code point validation. I think it'd make for a better user experience if `lstrip` and `rstrip` behaved consistently with each other, which would then unify the behavior in `rstrip`. What that behavior should be needs to be decided and I'm hoping to reach consensus on the semantics in this issue. -- https://bugs.ruby-lang.org/ Unsubscribe: