[ruby-core:117472] [Ruby master Feature#20415] Precompute literal String hash code during compilation
From:
"Eregon (Benoit Daloze) via ruby-core" <ruby-core@...>
Date:
2024-04-09 12:53:57 UTC
List:
ruby-core #117472
Issue #20415 has been updated by Eregon (Benoit Daloze).
FWIW TruffleRuby already does this, since frozen string literals need to be=
deduplicated, the hash needs to be computed, so might as well save it whil=
e doing so (the only downside being footprint).
```
truffleruby 24.0.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
Calculating -------------------------------------
symbol 107.376M (=B1 0.7%) i/s (9.31 ns/i) - 541.038=
M in 5.038971s
dyn_symbol 106.989M (=B1 0.7%) i/s (9.35 ns/i) - 543.771=
M in 5.082698s
small_lit 88.014M (=B1 0.6%) i/s (11.36 ns/i) - 442.996=
M in 5.033433s
frozen_lit 88.174M (=B1 0.3%) i/s (11.34 ns/i) - 444.293=
M in 5.038895s
Comparison:
symbol: 107376115.9 i/s
dyn_symbol: 106989494.3 i/s - same-ish: difference falls within e=
rror
frozen_lit: 88173794.6 i/s - 1.22x slower
small_lit: 88013579.7 i/s - 1.22x slower
```
Strings are still slower than Symbol keys, I suspect because `eql?` is quit=
e a bit more expensive for Strings.
Even if two Strings are interned it's not correct to compare them by identi=
ty, because they could still be `eql?` with the same bytes but different en=
codings. That case does not exist for Symbols.
----------------------------------------
Feature #20415: Precompute literal String hash code during compilation
https://bugs.ruby-lang.org/issues/20415#change-107860
* Author: byroot (Jean Boussier)
* Status: Open
----------------------------------------
I worked on a proof of concept with @etienne which I think has some potenti=
al, but I'm looking for feedback on what would be the best implementation.
The proof of concept is here: https://github.com/Shopify/ruby/pull/553
### Idea
Most string literals are relatively short, hence embedded, and have some wa=
sted bytes at the end of their slot. We could use that wasted space to stor=
e the string hash.
The goal being to make **looking up a literal String key in a hash, as fast=
as a Symbol key**. The goal isn't to memoize the hash code of all strings,=
but to **only selectively precompute the hash code of literal strings
in the compiler**. The compiler could even selectively do this when we lite=
ral string is used to lookup a hash (`opt_aref`).
Here's the benchmark we used:
```ruby
hash =3D 10.times.to_h do |i|
[i, i]
end
dyn_sym =3D "dynamic_symbol".to_sym
hash[:some_symbol] =3D 1
hash[dyn_sym] =3D 1
hash["small"] =3D 2
hash["frozen_string_literal"] =3D 2
Benchmark.ips do |x|
x.report("symbol") { hash[:some_symbol] }
x.report("dyn_symbol") { hash[:some_symbol] }
x.report("small_lit") { hash["small"] }
x.report("frozen_lit") { hash["frozen_string_literal"] }
x.compare!(order: :baseline)
end
```
On Ruby 3.3.0, looking up a String key is a bit slower based on the key siz=
e:
```
Calculating -------------------------------------
symbol 24.175M (=B1 1.7%) i/s - 122.002M in 5.048306s
dyn_symbol 24.345M (=B1 1.6%) i/s - 122.019M in 5.013400s
small_lit 21.252M (=B1 2.1%) i/s - 107.744M in 5.072042s
frozen_lit 20.095M (=B1 1.3%) i/s - 100.489M in 5.001681s
Comparison:
symbol: 24174848.1 i/s
dyn_symbol: 24345476.9 i/s - same-ish: difference falls within er=
ror
small_lit: 21252403.2 i/s - 1.14x slower
frozen_lit: 20094766.0 i/s - 1.20x slower
```
With the proof of concept performance is pretty much identical:
```
Calculating -------------------------------------
symbol 23.528M (=B1 6.9%) i/s - 117.584M in 5.033231s
dyn_symbol 23.777M (=B1 4.7%) i/s - 120.231M in 5.071734s
small_lit 23.066M (=B1 2.9%) i/s - 115.376M in 5.006947s
frozen_lit 22.729M (=B1 1.1%) i/s - 115.693M in 5.090700s
Comparison:
symbol: 23527823.6 i/s
dyn_symbol: 23776757.8 i/s - same-ish: difference falls within er=
ror
small_lit: 23065535.3 i/s - same-ish: difference falls within er=
ror
frozen_lit: 22729351.6 i/s - same-ish: difference falls within er=
ror
```
### Possible implementation
The reason I'm opening this issue early is to get feedback on which would b=
e the best implementation.
#### Store hashcode after the string terminator
Right now the proof of concept simply stores the `st_index_t` after the str=
ing null terminator, and only when the string is embedded and as enough lef=
t over space.
Strings with a precomputed hash are marked with an user flag.
Pros:
- Very simple implementation, no need to change a lot of code, and very e=
asy to strip out if we want to.
- Doesn't use any extra memory. If the string doesn't have enough left ov=
er bytes, the optimization simply isn't applied.
- The worst case overhead is a single `FL_TEST_RAW` in `rb_str_hash`.
Cons:
- The optimization won't apply to certain string sizes. e.g. strings betw=
een `17` and `23` bytes won't have a precomputed hash code.
- Extracting the hash code requires some not so nice pointer arithmetic.
#### Create another RString union
Another possibility would be to add another entry in the `RString` struct u=
nion, such as we'd have:
```c
struct RString {
struct RBasic basic;
long len;
union {
// ... existing members
struct {
st_index_t hash;
char ary[1];
} embded_literal;
} as;
};
```
Pros:
- The optimization can now be applied to all string sizes.
- The hashcode is always at the same offset and properly aligned.
Cons:
- Some strings would be bumped by one slot size, so would use marginally =
more memory.
- Complexify the code base more, need to modify a lot more string related=
code (e.g. `RSTRING_PTR` and many others)
- When compiling such string, if an equal string already exists in the `f=
string` table, we'd need to replace it, we can't just mutate it in place to=
add the hashcode.
### Prior art
[Feature #15331] is somewhat similar in its idea, but it does it lazily for=
all strings. Here it's much simpler because limited to string literals, wh=
ich are the ones likely to be used as Hash keys, and the overhead is on com=
pilation, not runtime (aside from a single flag check). So I think most of =
the caveats of that original implementation don't apply here.
--=20
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-c=
ore.ml.ruby-lang.org/