[ruby-core:114520] [Ruby master Bug#17263] Fiber context switch degrades with number of fibers, limit on number of fibers
From:
"ioquatix (Samuel Williams) via ruby-core" <ruby-core@...>
Date:
2023-08-25 00:13:51 UTC
List:
ruby-core #114520
Issue #17263 has been updated by ioquatix (Samuel Williams).
It looks like roughly 3 page faults per fiber. If I run `x` fibers, I get `3x` page faults. It's proportional to the number of fibers, but I'm not sure how expensive this is.
The CPU time is also costly, for `x` fibers, I get `50000x` kernel side CPU cycles. So for sure there is some overhead there. Hard to separate that from setup vs runtime though.
```
> sudo perf stat -e page-faults,cpu-cycles:u,cpu-cycles:k (which ruby) fiber.rb 100000
fibers: 100000 count: 1000000 rate: 2993069.28
Performance counter stats for '/home/samuel/.rubies/ruby-3.2.1/bin/ruby fiber.rb 100000':
323,536 page-faults
2,249,302,328 cpu-cycles:u
4,642,691,199 cpu-cycles:k
1.302578116 seconds time elapsed
0.439950000 seconds user
0.838409000 seconds sys
> sudo perf stat -e page-faults,cpu-cycles:u,cpu-cycles:k (which ruby) fiber.rb 1000000
fibers: 1000000 count: 1000000 rate: 3029874.53
Performance counter stats for '/home/samuel/.rubies/ruby-3.2.1/bin/ruby fiber.rb 1000000':
3,210,454 page-faults
5,704,584,641 cpu-cycles:u
49,187,058,607 cpu-cycles:k
10.457917495 seconds time elapsed
1.122311000 seconds user
9.121725000 seconds sys
```
----------------------------------------
Bug #17263: Fiber context switch degrades with number of fibers, limit on number of fibers
https://bugs.ruby-lang.org/issues/17263#change-104316
* Author: ciconia (Sharon Rosner)
* Status: Open
* Priority: Normal
* ruby -v: 2.7.1
* Backport: 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN
----------------------------------------
I'm working on developing [Polyphony](https://github.com/digital-fabric/polyphony), a Ruby gem for writing
highly-concurrent Ruby programs with fibers. In the course of my work I have
come up against two problems using Ruby fibers:
1. Fiber context switching performance seem to degrade as the number of fibers
is increased. This is both with `Fiber#transfer` and
`Fiber#resume/Fiber.yield`.
2. The number of concurrent fibers that can exist at any time seems to be
limited. Once a certain number is reached (on my system this seems to be
31744 fibers), calling `Fiber#transfer` will raise a `FiberError` with the
message `can't set a guard page: Cannot allocate memory`. This is not due to
RAM being saturated. With 10000 fibers, my test program hovers at around 150MB
RSS (on Ruby 2.7.1).
Here's a program for testing the performance of `Fiber#transfer`:
```ruby
# frozen_string_literal: true
require 'fiber'
class Fiber
attr_accessor :next
end
def run(num_fibers)
count = 0
GC.start
GC.disable
first = nil
last = nil
supervisor = Fiber.current
num_fibers.times do
fiber = Fiber.new do
loop do
count += 1
if count == 1_000_000
supervisor.transfer
else
Fiber.current.next.transfer
end
end
end
first ||= fiber
last.next = fiber if last
last = fiber
end
last.next = first
t0 = Time.now
first.transfer
elapsed = Time.now - t0
rss = `ps -o rss= -p #{Process.pid}`.to_i
puts "fibers: #{num_fibers} rss: #{rss} count: #{count} rate: #{count / elapsed}"
rescue Exception => e
puts "Stopped at #{count} fibers"
p e
end
run(100)
run(1000)
run(10000)
run(100000)
```
With Ruby 2.6.5 I'm getting:
```
fibers: 100 rss: 23212 count: 1000000 rate: 3357675.1688139187
fibers: 1000 rss: 31292 count: 1000000 rate: 2455537.056439736
fibers: 10000 rss: 127388 count: 1000000 rate: 954251.1674325482
Stopped at 22718 fibers
#<FiberError: can't set a guard page: Cannot allocate memory>
```
With Ruby 2.7.1 I'm getting:
```
fibers: 100 rss: 23324 count: 1000000 rate: 3443916.967616508
fibers: 1000 rss: 34676 count: 1000000 rate: 2333315.3862491543
fibers: 10000 rss: 151364 count: 1000000 rate: 916772.1008060966
Stopped at 31744 fibers
#<FiberError: can't set a guard page: Cannot allocate memory>
```
With ruby-head I get an almost identical result to that of 2.7.1.
As you can see, the performance degradation is similar in all the three versions
of Ruby, going from ~3.4M context switches per second for 100 fibers to less
then 1M context switches per second for 10000 fibers. Running with 100000 fibers
fails to complete.
Here's a program for testing the performance of `Fiber#resume/Fiber.yield`:
```ruby
# frozen_string_literal: true
require 'fiber'
class Fiber
attr_accessor :next
end
# This program shows how the performance of Fiber.transfer degrades as the fiber
# count increases
def run(num_fibers)
count = 0
GC.start
GC.disable
fibers = []
num_fibers.times do
fibers << Fiber.new { loop { Fiber.yield } }
end
t0 = Time.now
while count < 1000000
fibers.each do |f|
count += 1
f.resume
end
end
elapsed = Time.now - t0
puts "fibers: #{num_fibers} count: #{count} rate: #{count / elapsed}"
rescue Exception => e
puts "Stopped at #{count} fibers"
p e
end
run(100)
run(1000)
run(10000)
run(100000)
```
With Ruby 2.7.1 I'm getting the following output:
```
fibers: 100 count: 1000000 rate: 3048230.049946255
fibers: 1000 count: 1000000 rate: 2362235.6455160403
fibers: 10000 count: 1000000 rate: 950251.7621725246
Stopped at 21745 fibers
#<FiberError: can't set a guard page: Cannot allocate memory>
```
As I understand it, theoretically at least switching between fibers should have
a constant cost in terms of CPU cycles, irrespective of the number of fibers
currently existing in memory. I am completely ignorant the implementation
details of Ruby fibers, so at least for now I don't have any idea where this
problem is coming from.
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/