[#109115] [Ruby master Misc#18891] Expand tabs in C code — "k0kubun (Takashi Kokubun)" <noreply@...>

Issue #18891 has been reported by k0kubun (Takashi Kokubun).

13 messages 2022/07/02

[#109118] [Ruby master Bug#18893] Don't redefine memcpy(3) — "alx (Alejandro Colomar)" <noreply@...>

Issue #18893 has been reported by alx (Alejandro Colomar).

11 messages 2022/07/02

[#109152] [Ruby master Bug#18899] Inconsistent argument handling in IO#set_encoding — "javanthropus (Jeremy Bopp)" <noreply@...>

Issue #18899 has been reported by javanthropus (Jeremy Bopp).

10 messages 2022/07/06

[#109193] [Ruby master Bug#18909] ARGF.readlines reads more than current file — "JohanJosefsson (Johan Josefsson)" <noreply@...>

Issue #18909 has been reported by JohanJosefsson (Johan Josefsson).

17 messages 2022/07/13

[#109196] [Ruby master Bug#18911] Process._fork hook point is not called when Process.daemon is used — "ivoanjo (Ivo Anjo)" <noreply@...>

Issue #18911 has been reported by ivoanjo (Ivo Anjo).

9 messages 2022/07/13

[#109201] [Ruby master Bug#18912] Build failure with macOS 13 (Ventura) Beta — "hsbt (Hiroshi SHIBATA)" <noreply@...>

Issue #18912 has been reported by hsbt (Hiroshi SHIBATA).

20 messages 2022/07/14

[#109206] [Ruby master Bug#18914] Segmentation fault during Ruby test suite execution — "jprokop (Jarek Prokop)" <noreply@...>

Issue #18914 has been reported by jprokop (Jarek Prokop).

8 messages 2022/07/14

[#109207] [Ruby master Feature#18915] New error class: NotImplementedYetError or scope change for NotImplementedYet — Quintasan <noreply@...>

Issue #18915 has been reported by Quintasan (Michał Zając).

18 messages 2022/07/14

[#109260] [Ruby master Feature#18930] Officially deprecate class variables — "Eregon (Benoit Daloze)" <noreply@...>

Issue #18930 has been reported by Eregon (Benoit Daloze).

21 messages 2022/07/20

[#109314] [Ruby master Bug#18938] Backport cf7d07570f50ef9c16007019afcff11ba6500d70 — "byroot (Jean Boussier)" <noreply@...>

Issue #18938 has been reported by byroot (Jean Boussier).

8 messages 2022/07/25

[#109371] [Ruby master Feature#18949] Deprecate and remove replicate and dummy encodings — "Eregon (Benoit Daloze)" <noreply@...>

Issue #18949 has been reported by Eregon (Benoit Daloze).

35 messages 2022/07/29

[ruby-core:109264] [Ruby master Bug#18931] Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip

From: "nirvdrum (Kevin Menard)" <noreply@...>
Date: 2022-07-20 16:25:04 UTC
List: ruby-core #109264
Issue #18931 has been reported by nirvdrum (Kevin Menard).

----------------------------------------
Bug #18931: Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip
https://bugs.ruby-lang.org/issues/18931

* Author: nirvdrum (Kevin Menard)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
* Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
When attempting to strip a string, there are three basic options when an invalid code point is encountered:

1) Ignore the code point
2) Strip the code point
3) Raise an exception

For background, Ruby does not consider the string's code range for `lstrip` or `rstrip`. It permits stripping strings with a `ENC_CODERANGE_BROKEN` so long as any invalid code points are not encountered while performing the loop to remove whitespace. What it does when such a code point is encountered, however, is not consistent between `lstrip` and `rstrip`.

`String#lstrip` will unconditionally raise an invalid byte sequence error:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p " \x80abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p " \x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e ' p " a\x80bc".lstrip'
"a\x80bc"   # This one is okay because the broken code point appears after a non-whitespace code point.
```

Things get a lot messier with `String#rstrip`, however. Depending on context, `rstrip` may raise an exception, treat the broken code point as a non-whitespace boundary and stop processing, or treat the broken code point as if it were whitespace and remove it.

`String#rstrip` will ignore the invalid code point if it immediately follows a non-whitespace code point:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "abc\x80 ".rstrip'
"abc\x80"

> ruby -e 'p "abc\x80".rstrip'
"abc\x80"
```

`String#rstrip` will remove the invalid code point if it is surround by whitespace:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "abc \x80".rstrip'
"abc"

> ruby -e 'p "abc \x80 ".rstrip'
"abc"

> ruby -e 'p " \x80 ".rstrip'
""
```

`String#rstrip` will raise an exception if no valid, non-whitespace code points appear before it:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "\x80 ".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'
```

It looks to me like the current behavior is a byproduct of the functions chosen for finding code point boundaries, rather than something deliberately chosen. E.g., `rb_str_lstrip` will call `rb_enc_codepoint_len`, which raises on invalid code points, while `rb_str_rstrip` calls `rb_enc_prev_char`, which doesn't perform the same code point validation.  I think it'd make for a better user experience if `lstrip` and `rstrip` behaved consistently with each other, which would then unify the behavior in `rstrip`. What that behavior should be needs to be decided and I'm hoping to reach consensus on the semantics in this issue.





-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

In This Thread

Prev Next