[#109115] [Ruby master Misc#18891] Expand tabs in C code — "k0kubun (Takashi Kokubun)" <noreply@...>

Issue #18891 has been reported by k0kubun (Takashi Kokubun).

13 messages 2022/07/02

[#109118] [Ruby master Bug#18893] Don't redefine memcpy(3) — "alx (Alejandro Colomar)" <noreply@...>

Issue #18893 has been reported by alx (Alejandro Colomar).

11 messages 2022/07/02

[#109152] [Ruby master Bug#18899] Inconsistent argument handling in IO#set_encoding — "javanthropus (Jeremy Bopp)" <noreply@...>

Issue #18899 has been reported by javanthropus (Jeremy Bopp).

10 messages 2022/07/06

[#109193] [Ruby master Bug#18909] ARGF.readlines reads more than current file — "JohanJosefsson (Johan Josefsson)" <noreply@...>

Issue #18909 has been reported by JohanJosefsson (Johan Josefsson).

17 messages 2022/07/13

[#109196] [Ruby master Bug#18911] Process._fork hook point is not called when Process.daemon is used — "ivoanjo (Ivo Anjo)" <noreply@...>

Issue #18911 has been reported by ivoanjo (Ivo Anjo).

9 messages 2022/07/13

[#109201] [Ruby master Bug#18912] Build failure with macOS 13 (Ventura) Beta — "hsbt (Hiroshi SHIBATA)" <noreply@...>

Issue #18912 has been reported by hsbt (Hiroshi SHIBATA).

20 messages 2022/07/14

[#109206] [Ruby master Bug#18914] Segmentation fault during Ruby test suite execution — "jprokop (Jarek Prokop)" <noreply@...>

Issue #18914 has been reported by jprokop (Jarek Prokop).

8 messages 2022/07/14

[#109207] [Ruby master Feature#18915] New error class: NotImplementedYetError or scope change for NotImplementedYet — Quintasan <noreply@...>

Issue #18915 has been reported by Quintasan (Michał Zając).

18 messages 2022/07/14

[#109260] [Ruby master Feature#18930] Officially deprecate class variables — "Eregon (Benoit Daloze)" <noreply@...>

Issue #18930 has been reported by Eregon (Benoit Daloze).

21 messages 2022/07/20

[#109314] [Ruby master Bug#18938] Backport cf7d07570f50ef9c16007019afcff11ba6500d70 — "byroot (Jean Boussier)" <noreply@...>

Issue #18938 has been reported by byroot (Jean Boussier).

8 messages 2022/07/25

[#109371] [Ruby master Feature#18949] Deprecate and remove replicate and dummy encodings — "Eregon (Benoit Daloze)" <noreply@...>

Issue #18949 has been reported by Eregon (Benoit Daloze).

35 messages 2022/07/29

[ruby-core:109265] [Ruby master Bug#18931] Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip

From: "nirvdrum (Kevin Menard)" <noreply@...>
Date: 2022-07-20 16:38:37 UTC
List: ruby-core #109265
Issue #18931 has been updated by nirvdrum (Kevin Menard).


My own take on three options, with no significance to the order, are:

**Ignore the code point**

The documentation for `lstrip` is "Returns a copy of the receiver with leading whitespace removed." It seems fairly straightforward and there's no mention of string validation; raising an exception might violate user expectations. 
Treating broken code points the same as any other non-whitespace code point would be logically consistent. An additional benefit is the method could be implemented more efficiently as the whitespace check can be done without calculating code point boundaries. Only ASCII whitespace code points are stripped and those, by definition, are only one byte wide. However, if `lstrip` and `rstrip` ever evolve to handle non-ASCII whitespace we'll be back to calculating code point boundaries.


**Strip the code point**

Despite `rstrip` doing it in some cases, I don't think removing the invalid code points is what an end user would expect and runs counter to the method's documentation.

**Raise an exception**

Given `lstrip`'s behavior, raising in all cases would be the most backward-compatible and is consistent with equivalent expressions (e.g., `" \x80 abc".sub(/^\s+/, "")` will raise an error on the invalid byte sequence). While the documentation makes no mention of string validation, encountering an invalid code point is arguably an exceptional condition.

----------------------------------------
Bug #18931: Inconsistent handling of invalid codepoints in String#lstrip and String#rstrip
https://bugs.ruby-lang.org/issues/18931#change-98397

* Author: nirvdrum (Kevin Menard)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
* Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
When attempting to strip a string, there are three basic options when an invalid code point is encountered:

1) Ignore the code point
2) Strip the code point
3) Raise an exception

For background, Ruby does not consider the string's code range for `lstrip` or `rstrip`. It permits stripping strings with a `ENC_CODERANGE_BROKEN` so long as any invalid code points are not encountered while performing the loop to remove whitespace. What it does when such a code point is encountered, however, is not consistent between `lstrip` and `rstrip`.

`String#lstrip` will unconditionally raise an invalid byte sequence error:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p " \x80abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p " \x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80 abc".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80".lstrip'
-e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e ' p " a\x80bc".lstrip'
"a\x80bc"   # This one is okay because the broken code point appears after a non-whitespace code point.
```

Things get a lot messier with `String#rstrip`, however. Depending on context, `rstrip` may raise an exception, treat the broken code point as a non-whitespace boundary and stop processing, or treat the broken code point as if it were whitespace and remove it.

`String#rstrip` will ignore the invalid code point if it immediately follows a non-whitespace code point:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "abc\x80 ".rstrip'
"abc\x80"

> ruby -e 'p "abc\x80".rstrip'
"abc\x80"
```

`String#rstrip` will remove the invalid code point if it is surround by whitespace:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "abc \x80".rstrip'
"abc"

> ruby -e 'p "abc \x80 ".rstrip'
"abc"

> ruby -e 'p " \x80 ".rstrip'
""
```

`String#rstrip` will raise an exception if no valid, non-whitespace code points appear before it:

```
> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]

> ruby -e 'p "\x80 ".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'

> ruby -e 'p "\x80".rstrip'
-e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `<main>'
```

It looks to me like the current behavior is a byproduct of the functions chosen for finding code point boundaries, rather than something deliberately chosen. E.g., `rb_str_lstrip` will call `rb_enc_codepoint_len`, which raises on invalid code points, while `rb_str_rstrip` calls `rb_enc_prev_char`, which doesn't perform the same code point validation.  I think it'd make for a better user experience if `lstrip` and `rstrip` behaved consistently with each other, which would then unify the behavior in `rstrip`. What that behavior should be needs to be decided and I'm hoping to reach consensus on the semantics in this issue.





-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

In This Thread