[#111565] [Ruby master Bug#19293] The new Time.new(String) API is nice... but we still need a stricter version of this — "matsuda (Akira Matsuda) via ruby-core" <ruby-core@...>

Issue #19293 has been reported by matsuda (Akira Matsuda).

8 messages 2023/01/01

[#111572] [Ruby master Bug#19297] Don't download content from internet to execute Ruby test suite — "vo.x (Vit Ondruch) via ruby-core" <ruby-core@...>

Issue #19297 has been reported by vo.x (Vit Ondruch).

12 messages 2023/01/02

[#111579] [Ruby master Feature#19300] Move public objects from Kernel to Object — "zverok (Victor Shepelev) via ruby-core" <ruby-core@...>

Issue #19300 has been reported by zverok (Victor Shepelev).

15 messages 2023/01/02

[#111581] [Ruby master Bug#19301] Fix Data class to report keyrest instead of rest parameters — "bkuhlmann (Brooke Kuhlmann) via ruby-core" <ruby-core@...>

SXNzdWUgIzE5MzAxIGhhcyBiZWVuIHJlcG9ydGVkIGJ5IGJrdWhsbWFubiAoQnJvb2tlIEt1aGxt

8 messages 2023/01/02

[#111604] [Ruby master Misc#19304] Kernel vs Object documentation — "zverok (Victor Shepelev) via ruby-core" <ruby-core@...>

Issue #19304 has been reported by zverok (Victor Shepelev).

8 messages 2023/01/03

[#111674] [Ruby master Feature#19314] String#bytesplice should support partial copy — "shugo (Shugo Maeda) via ruby-core" <ruby-core@...>

Issue #19314 has been reported by shugo (Shugo Maeda).

8 messages 2023/01/06

[#111678] [Ruby master Feature#19315] Lazy substrings in CRuby — "Eregon (Benoit Daloze) via ruby-core" <ruby-core@...>

Issue #19315 has been reported by Eregon (Benoit Daloze).

11 messages 2023/01/06

[#111693] [Ruby master Bug#19316] YJIT crash in 3.2.0 — "jdashton (J Daniel Ashton) via ruby-core" <ruby-core@...>

Issue #19316 has been reported by jdashton (J Daniel Ashton).

12 messages 2023/01/06

[#111696] [Ruby master Feature#19317] Unicode ICU Full case mapping — "noraj (Alexandre ZANNI) via ruby-core" <ruby-core@...>

SXNzdWUgIzE5MzE3IGhhcyBiZWVuIHJlcG9ydGVkIGJ5IG5vcmFqIChBbGV4YW5kcmUgWkFOTkkp

7 messages 2023/01/06

[#111712] [Ruby master Feature#19322] Support spawning "private" child processes — "kjtsanaktsidis (KJ Tsanaktsidis) via ruby-core" <ruby-core@...>

SXNzdWUgIzE5MzIyIGhhcyBiZWVuIHJlcG9ydGVkIGJ5IGtqdHNhbmFrdHNpZGlzIChLSiBUc2Fu

14 messages 2023/01/07

[#111739] [Ruby master Feature#19324] Enumerator.product => Enumerable#product — "zverok (Victor Shepelev) via ruby-core" <ruby-core@...>

Issue #19324 has been reported by zverok (Victor Shepelev).

18 messages 2023/01/08

[#111740] [Ruby master Bug#19325] Windows support lacking. — "dsisnero (Dominic Sisneros) via ruby-core" <ruby-core@...>

Issue #19325 has been reported by dsisnero (Dominic Sisneros).

11 messages 2023/01/08

[#111742] [Ruby master Feature#19326] Please add a better API for passing a Proc to a Ractor — sdwolfz via ruby-core <ruby-core@...>

SXNzdWUgIzE5MzI2IGhhcyBiZWVuIHJlcG9ydGVkIGJ5IHNkd29sZnogKENvZHJ1yJsgR3XImW9p

13 messages 2023/01/08

[#111789] [Ruby master Feature#19333] Setting (Fiber Local|Thread Local|Fiber Storage) to nil should delete value in order to avoid memory leaks. — "ioquatix (Samuel Williams) via ruby-core" <ruby-core@...>

Issue #19333 has been reported by ioquatix (Samuel Williams).

10 messages 2023/01/11

[#111792] [Ruby master Bug#19334] Defining many instance variables and accessing them is slow in Ruby 3.2.0 — "mame (Yusuke Endoh) via ruby-core" <ruby-core@...>

Issue #19334 has been reported by mame (Yusuke Endoh).

12 messages 2023/01/12

[#111812] [Ruby master Bug#19340] Ruby master 'make install' not installing rbs gem — "MSP-Greg (Greg L) via ruby-core" <ruby-core@...>

Issue #19340 has been reported by MSP-Greg (Greg L).

8 messages 2023/01/14

[#111842] [Ruby master Feature#19347] Add Dir.fchdir — "jeremyevans0 (Jeremy Evans) via ruby-core" <ruby-core@...>

Issue #19347 has been reported by jeremyevans0 (Jeremy Evans).

9 messages 2023/01/16

[#111873] [Ruby master Bug#19351] Promote bundled gems at Ruby 3.3 — "hsbt (Hiroshi SHIBATA) via ruby-core" <ruby-core@...>

Issue #19351 has been reported by hsbt (Hiroshi SHIBATA).

26 messages 2023/01/18

[#111890] [Ruby master Bug#19352] Feature #17391 was not such a good idea because now Ruby 3.2 can not install Rails v5 or v6 if they use webpacker. — "Milella@... (Scott Milella) via ruby-core" <ruby-core@...>

Issue #19352 has been reported by Milella@Hotmail.com (Scott Milella).

16 messages 2023/01/19

[#111953] [Ruby master Bug#19362] #dup on Proc doesn't call initialize_dup — "zverok (Victor Shepelev) via ruby-core" <ruby-core@...>

Issue #19362 has been reported by zverok (Victor Shepelev).

8 messages 2023/01/21

[#111956] [Ruby master Bug#19363] Fix rb_transient_heap_mark: wrong header (T_STRUCT) segfault — "bkuhlmann (Brooke Kuhlmann) via ruby-core" <ruby-core@...>

SXNzdWUgIzE5MzYzIGhhcyBiZWVuIHJlcG9ydGVkIGJ5IGJrdWhsbWFubiAoQnJvb2tlIEt1aGxt

9 messages 2023/01/21

[#111988] [Ruby master Feature#19370] Anonymous parameters for blocks? — "zverok (Victor Shepelev) via ruby-core" <ruby-core@...>

Issue #19370 has been reported by zverok (Victor Shepelev).

10 messages 2023/01/23

[#112041] [Ruby master Feature#19377] Rename Fiber#storage to Fiber.storage — "zverok (Victor Shepelev) via ruby-core" <ruby-core@...>

Issue #19377 has been reported by zverok (Victor Shepelev).

8 messages 2023/01/25

[#112045] [Ruby master Bug#19378] Windows: Use less syscalls for faster require of big gems — "aidog (Andi Idogawa) via ruby-core" <ruby-core@...>

SXNzdWUgIzE5Mzc4IGhhcyBiZWVuIHJlcG9ydGVkIGJ5IGFpZG9nIChBbmRpIElkb2dhd2EpLg0N

7 messages 2023/01/26

[#112048] [Ruby master Bug#19379] Regex: "end pattern with unmatched parenthesis" with Ruby 3.2 and interpolation — "renchap (Renaud Chaput) via ruby-core" <ruby-core@...>

Issue #19379 has been reported by renchap (Renaud Chaput).

8 messages 2023/01/26

[#112058] [Ruby master Bug#19383] Time.now.zone encoding for German display language in Windows is incorrect — "stringsn88keys (Thomas Powell) via ruby-core" <ruby-core@...>

SXNzdWUgIzE5MzgzIGhhcyBiZWVuIHJlcG9ydGVkIGJ5IHN0cmluZ3NuODhrZXlzIChUaG9tYXMg

11 messages 2023/01/26

[#112072] [Ruby master Bug#19386] `test_hmac.rb` of openssl is timeout on RHEL9 — "hsbt (Hiroshi SHIBATA) via ruby-core" <ruby-core@...>

Issue #19386 has been reported by hsbt (Hiroshi SHIBATA).

14 messages 2023/01/27

[#112091] [Ruby master Bug#19387] Issue with ObjectSpace.each_objects not returning IO objects after starting a ractor — "luke-gru (Luke Gruber) via ruby-core" <ruby-core@...>

Issue #19387 has been reported by luke-gru (Luke Gruber).

9 messages 2023/01/27

[#112119] [Ruby master Bug#19392] Endless method vs and/or — "zverok (Victor Shepelev) via ruby-core" <ruby-core@...>

Issue #19392 has been reported by zverok (Victor Shepelev).

20 messages 2023/01/30

[#112146] [Ruby master Bug#19394] cvars in instance of cloned class point to source class's cvars even after class_variable_set on clone — "jamescdavis (James Davis) via ruby-core" <ruby-core@...>

SXNzdWUgIzE5Mzk0IGhhcyBiZWVuIHJlcG9ydGVkIGJ5IGphbWVzY2RhdmlzIChKYW1lcyBEYXZp

8 messages 2023/01/31

[ruby-core:111697] [Ruby master Feature#18949] Deprecate and remove replicate and dummy encodings

From: "Eregon (Benoit Daloze) via ruby-core" <ruby-core@...>
Date: 2023-01-06 15:19:16 UTC
List: ruby-core #111697
Issue #18949 has been updated by Eregon (Benoit Daloze).



Target version set to 3.3



This is all done now, only https://github.com/ruby/ruby/pull/7079 left and =
I'll merge that when it passes CI.



Overall:

* We deprecated and removed `Encoding#replicate`

* We removed `get_actual_encoding()`

* We limited to 256 encodings and kept `rb_define_dummy_encoding()` with th=
at constraint.

* There is a single flat array to lookup encodings, `rb_enc_from_index()` i=
s fast now.



Since the limit is 256 and not 128 though it means `ENCODING_GET` is not ju=
st `RB_ENCODING_GET_INLINED` but still has the check and slow fallback.



Thank you for the discussion, @ko1 for implementing the fixed-size table an=
d let's close this.

Of course for all builtin encodings the cost is just the extra check.

Maybe the limit could be changed later to 128 if this optimization is wante=
d.



----------------------------------------

Feature #18949: Deprecate and remove replicate and dummy encodings

https://bugs.ruby-lang.org/issues/18949#change-101095



* Author: Eregon (Benoit Daloze)

* Status: Open

* Priority: Normal

* Assignee: Eregon (Benoit Daloze)

* Target version: 3.3

----------------------------------------

Ruby has a lot of accidental complexity.

Sometimes it becomes clear some features bring a lot of complexity and yet =
provide little value or are used very rarely.

Also most Ruby users do not even know about these features.

Replicate and dummy encodings seem to clearly fall into this category, almo=
st nobody uses them but they add a significant complexity and also add a si=
gnificant performance overhead.

Notably, the existence of those means the number of encodings in a Ruby run=
time is actually variable and not fixed.

That means extra synchronization, hashtable lookups, indirections, function=
 calls, etc.



## Replicate Encodings



Replicate encodings are created using `Encoding#replicate(name)`.

It almost sounds like an alias but in fact it is more than that and creates=
 a new Encoding object, which can be used by a String:

```ruby

e =3D Encoding::US_ASCII.replicate('MY-US-ASCII')

s =3D "abc".force_encoding(e)

p s.encoding # =3D> e

p s.encoding.name # =3D> 'MY-US-ASCII'

```



This seems completely useless.

There is an obvious first step here which is to change `Encoding#replicate`=
 to return the receiver, and just install an alias for it.

That avoids creating more encoding instances needlessly.



I think we should also deprecate and remove this method though, it is never=
 a good idea to have a global mutable map like this.

If someone want extra aliases for encodings, they can easily to do so by ha=
ving their own Hash: `{ alias =3D> encoding }.fetch(name) { Encoding.find(n=
ame) }`.



## Dummy Encodings



Dummy encodings are not real encodings. They are artificial encodings desig=
ned to look like encodings, but don't function as encodings in Ruby.

>From the docs:

```

enc.dummy? -> true or false

------------------------------------------------------------------------

Returns true for dummy encodings. A dummy encoding is an encoding for

which character handling is not properly implemented. It is used for

stateful encodings.

```



I wonder why we have those half-implemented encodings in core, it sounds to=
 me like unfinished work which should not have been merged.



The "codepoints" of dummy encodings are just "bytes" and so they behave the=
 same as `Encoding::BINARY`, with the exception of the UTF-16 and UTF-32 du=
mmy encodings.



### UTF-16 and UTF-32 dummy encodings



These two are special dummy encodings.

What they do is they scan the first 2 or 4 bytes of the String, and if thos=
e bytes are a byte-order mark (BOM),

the true "actual" encoding is resolved to UTF-16BE/UTF-16LE or UTF-32BE/UTF=
-32LE.

Otherwise, `Encoding::BINARY` is returned.

This logic is done by `get_actual_encoding()`.



What is weird is this check is not done on String creation, no, it is done =
*every time* the encoding of that String is accessed (and the result is not=
 stored on the String).

That is a needless overhead and really unreliable semantics.

Do we really want a String which automagically changes between UTF-16LE and=
 UTF-16BE based on mutating its bytes? I think nobody wants that:

```ruby

s =3D "\xFF\xFEa\x00b\x00c\x00d\x00".force_encoding("UTF-16")

p s # =3D> "\uFEFFabcd"

s.setbyte 0, 254

s.setbyte 1, 255

p s # =3D> "\uFEFF\u6100\u6200\u6300\u6400"

```



I think the path is clear, we should deprecate and then remove Encoding::UT=
F_16 and Encoding::UTF_32 (dummy encodings).

And then we no longer need `get_actual_encoding()` and the overhead it adds=
 to every String method.



We could also keep those constants and make them refer the native-endian UT=
F-16/32.

But that could cause confusing errors as we would change the meaning of the=
m.

We could add `Encoding::UTF_16NE` / `Encoding::UTF_16_NATIVE_ENDIAN` if tha=
t is useful.



Another possibility would be to resolve these encodings on String creation,=
 like:

```

"\xFF\xFE".force_encoding("UTF-16").encoding # =3D> UTF-16LE

String.new("\xFF\xFE", encoding: Encoding::UTF_16).encoding # =3D> UTF-16LE

"ab".force_encoding("UTF-16").encoding # exception, not a BOM

String.new("ab", encoding: Encoding::UTF_16).encoding # exception, not a BOM

```

I think it is unnecessary to keep such complexity though.

A class method on String or Encoding like e.g. `Encoding.find_from_bom(stri=
ng)` is so much clearer and efficient (no need to special case those encodi=
ngs in String.new, #force_encoding, etc).



FWIW JRuby seems to use `getActualEncoding()` only in 2 places (scanForCode=
Range, inspect), which is an indication those dummy UTF encodings are barel=
y used if ever. Similarly, TruffleRuby only has 4 usages of `GetActualEncod=
ingNode`.



### Existing dummy encodings



```

> Encoding.list.select(&:dummy?)=20

[#<Encoding:UTF-16 (dummy)>,  #<Encoding:UTF-32 (dummy)>,

 #<Encoding:IBM037 (dummy)>, #<Encoding:UTF-7 (dummy)>,

 #<Encoding:ISO-2022-JP (dummy)>, #<Encoding:ISO-2022-JP-2 (dummy)>, #<Enco=
ding:ISO-2022-JP-KDDI (dummy)>,

 #<Encoding:CP50220 (dummy)>, #<Encoding:CP50221 (dummy)>]

```



So besides UTF-16/UTF-32 dummy, it's only 7 encodings.

Does anyone use one of these 7 dummy encodings?



What is interesting to note, is that these encodings are exactly the ones t=
hat are also not ASCII-compatible, with the exception of UTF-16BE/UTF-16LE/=
UTF-32BE/UTF-32LE (non-dummy).

As a note, UTF-{16,32}{BE,LE} are ASCII-compatible in codepoints but not in=
 bytes, and Ruby uses the bytes definition of ASCII-compatible.

There is potential to simplify encoding compatibility rules and encoding co=
mpatibility checks based on that.

So what this means is if we removed dummy encodings, all encodings except U=
TF-{16,32}{BE,LE} would be ASCII-compatible, which would lead to significan=
t simplifications for many string operations which currently need to handle=
 dummy encodings specially.

Unicode encodings like UTF-{16,32}{BE,LE} already have special behavior for=
 some Ruby methods, so those are already handled specially in some places (=
they are the only encodings with minLength > 1).



```

> Encoding.list.reject(&:ascii_compatible?)

[#<Encoding:UTF-16BE>, #<Encoding:UTF-16LE>,

 #<Encoding:UTF-32BE>, #<Encoding:UTF-32LE>,

 #<Encoding:UTF-16 (dummy)>, #<Encoding:UTF-32 (dummy)>,

 #<Encoding:IBM037 (dummy)>, #<Encoding:UTF-7 (dummy)>,

 #<Encoding:ISO-2022-JP (dummy)>, #<Encoding:ISO-2022-JP-2 (dummy)>, #<Enco=
ding:ISO-2022-JP-KDDI (dummy)>,

 #<Encoding:CP50220 (dummy)>, #<Encoding:CP50221 (dummy)>]

```



What can we do with such a dummy non-ASCII-compatible encoding?

Almost nothing useful:

```ruby

s =3D "abc".encode("IBM037")

=3D> "\x81\x82\x83"

> s.bytes

=3D> [129, 130, 131]

> s.codepoints

=3D> [129, 130, 131]

> s =3D=3D "abc"

=3D> false

> "=E9t=E9".encode("IBM037")

=3D> "\x51\xA3\x51"

```



So about the only thing that works with them is `String#encode`.



I think we could preserve that functionality, if actually used (does anyone=
 use one of these 7 dummy encodings?), through:

```ruby

> "=E9t=E9".encode("IBM037")

=3D> "\x51\xA3\x51" (.encoding =3D=3D BINARY)

> "\x51\xA3\x51".encode("UTF-8", "IBM037") # encode from IBM037 to UTF-8

=3D> "=E9t=E9" (.encoding =3D=3D UTF-8)

```



That way there is no need for those to be Encoding instances, we would only=
 need the conversion tables.



It is even better if we can remove them, so the notion of "dummy encodings"=
 can disappear completely and nobody needs to understand or implement them.



### rb_define_dummy_encoding(name)



The C-API has `rb_define_dummy_encoding(const char *name)`.

This creates a new Encoding instance with `dummy?=3Dtrue`, and it is also n=
on-ASCII-compatible.

There seems to be no purpose to this besides storing the metadata of an enc=
oding which does not exist in Ruby.

This seems a really expensive/complex way to handle that from the VM point =
of view (because it dynamically creates an Encoding and add it to lists/map=
s/etc).

A simple replacement would be to mark the String as BINARY and save the enc=
oding name as an instance variable of that String.

Since anyway Ruby can't understand anything about that String, it's just ra=
w bytes to Ruby's eyes.



## Summary



I suggest we deprecate replicate and dummy encodings in Ruby 3.2.

And then we remove them in the next version.



This will significantly simplify string-related methods, and the behavior e=
xposed to Ruby users.



It will also significantly speedup encoding lookup in CRuby (and other Ruby=
 implementations).

With a fixed number of encodings we can ensure all encoding indices fit in =
7 bits, and `ENCODING_GET` can be simply `RB_ENCODING_GET_INLINED`.

`get_actual_encoding()` will be gone and its overhead as well.

`rb_enc_from_index()` would be just `return global_enc_table->list[index].e=
nc;`, instead of the expensive behavior currently with `GLOBAL_ENC_TABLE_EV=
AL` which takes a lock and more when there are multiple Ractors.

Many checks in these methods would be removed as well.

Yet another improvement would be to load all encodings eagerly, that is sma=
ll and fast in my experience, what is slow and big is the conversion tables=
, that'd simplify `must_encindex()` further.

These changes would affect most String methods, which use

```

STR_ENC_GET->get_encoding which does:

  get_actual_encoding->rb_enc_from_index and possibly ->enc_from_index

  ENCODING_GET->RB_ENCODING_GET_INLINED and possibly ->rb_enc_get_index->en=
c_get_index_str->rb_attr_get

```

Some of these details are mentioned in https://github.com/ruby/ruby/pull/60=
95#discussion_r915149708.

The overhead is so large that it is worth handling some hardcoded encoding =
indices directly in String methods.

This feels wrong, getting the encoding from a String should be simple, stra=
ightforward and fast.



Further optimizations will be unlocked as the encoding list becomes fixed a=
nd immutable.

For example, the name-to-Encoding map is then immutable and could use perfe=
ct hashing.

Inline caching those lookups also becomes easier as the the map cannot chan=
ge.

Also that map would no longer need synchronization, etc.



## To Decide



Each item is independent. I think 1 & 2 are very important, 3 less but woul=
d be nice.



1. Deprecate and then remove `Encoding#replicate` and `rb_define_dummy_enco=
ding()`. With that there is a fixed number of encodings, a lot of simplific=
ations and many optimizations become available. They are used respectively =
in only 1 gem and 5 gems, see https://bugs.ruby-lang.org/issues/18949#note-4

2. Deprecate and then remove the dummy UTF-16 and UTF-32 encodings. This re=
moves the need for `get_actual_encoding()` which is expensive. This functio=
nality seems rarely used in practice, and it only works when such strings h=
ave a BOM, which is very rare.

3. Deprecate and then remove other dummy encodings, so there are no more du=
mmy "half-implemented" encodings and all encodings are ASCII-compatible in =
terms of codepoints.







--=20

https://bugs.ruby-lang.org/

 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-c=
ore.ml.ruby-lang.org/

In This Thread

Prev Next