From: samuel@... Date: 2018-04-30T01:24:32+00:00 Subject: [ruby-core:86768] [Ruby trunk Feature#13618] [PATCH] auto fiber schedule for rb_wait_for_single_fd and rb_waitpid Issue #13618 has been updated by ioquatix (Samuel Williams). > Using a background thread is your mistake. Don't assume I made this design. It was made by other people. I merely tested it because I was interested in the performance overhead. And yes, there is significant overhead. And let's be generous, people who invested their time and effort to make such a thing for Ruby deserve our appreciation. Knowing that the path they chose to explore was not good is equally important. > Multiple foreground threads safely use epoll_wait or kevent on the SAME epoll or kqueue fd. It's perfectly safe to do that. Sure, that's reasonable. If you want to share those data structures across threads, you can dispatch your work in different threads too. I liked what you did with https://yhbt.net/yahns/yahns.txt and it's an interesting design. The biggest single benefit of this design is that blocking operations in an individual "task" or "worker" won't block any other "task" or "worker", up to the limit of the thread pool you allocate, at which point things WILL start causing blocking. So you can't avoid blocking even with this design. The major downside of such a design is that workers have to assume they could be running on different threads, so shared data structure needs locking/will cause contention. In addition the current state of the Ruby GIL means that any such design will generally have poor performance. Here is almost identical code path running, one with 8 forked processes, and one with 8 threads, running on Ruby 2.5: ``` > falcon serve --threaded > wrk -t8 -c8 -d10 http://localhost:9292 Running 10s test @ http://localhost:9292 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 54.67ms 25.39ms 189.02ms 72.29% Req/Sec 18.50 7.18 40.00 53.38% 1483 requests in 10.04s, 174.88MB read Requests/sec: 147.74 Transfer/sec: 17.42MB > falcon serve --forked > wrk -t8 -c8 -d10 http://localhost:9292 Running 10s test @ http://localhost:9292 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 29.77ms 66.90ms 571.70ms 93.71% Req/Sec 71.50 19.54 128.00 83.42% 5442 requests in 10.10s, 641.61MB read Requests/sec: 538.90 Transfer/sec: 63.54MB ``` This test is actually on a fresh Rails website (Rails performance isn't great to begin with), on macOS which has pretty bad IO performance. Running the same thing on Linux gives: ``` % falcon serve --threaded % wrk -t8 -c8 -d10 http://localhost:9292 Running 10s test @ http://localhost:9292 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 26.41ms 13.74ms 123.01ms 69.85% Req/Sec 38.53 11.26 80.00 63.38% 3082 requests in 10.01s, 363.36MB read Requests/sec: 307.99 Transfer/sec: 36.31MB % falcon serve --forked % wrk -t8 -c8 -d10 http://localhost:9292 Running 10s test @ http://localhost:9292 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 9.78ms 24.91ms 309.70ms 97.59% Req/Sec 168.68 49.75 262.00 63.89% 13203 requests in 10.02s, 1.52GB read Requests/sec: 1318.05 Transfer/sec: 155.39MB ``` So, I think it's safe to say, that in an end to end test, the GIL is a MAJOR performance issue. Feel free to correct me if you think I'm wrong. I'm sure this story is more complicated than the above benchmarks, but I felt like it was a useful comparison. Therefore, right now, for highly concurrent IO with Ruby, what you actually want is the following: - One process per CPU core. - One IO thread per process. - Multiple fibers, one per worker. Blocking operations that are causing performance issues should use a thread pool. For things like launching an external process or syscall, and waiting for it to finish, threads are ideal. The major benefit of such a design is that individual fibers all run on the same thread. You ultimately have similar issues w.r.t. blocking as yahns. However, because all workers run concurrently on the same thread, you don't have any locking/concurrency/mutability issues. To me, this is a massive benefit as it makes writing code with this model super easy. > Typical reactor is not designed to handle that :P Yes, but it's by design, not by accident. If you need to scale up, just fork more reactors. On the linux desktop above, `async-http` can handle 100,000+ requests per second using 8 cores for trivial benchmarks. So, performance is something which can scale. The next question then, is design. There is some elegance in the design you propose. Your proposal requires some kind of "Task" or "Worker" which is a fiber which will yield when IO would block, and resume when IO is ready. Based on what you've said, do you mind explaining whether the "Task" or "Worker" is resumed on the same thread or a different one? Do you maintain a thread pool? If it's always resumed on the same thread, how do you manage that? e.g. perhaps you can show me how the following would work: ``` Thread.new do Worker.new do # .. blocking IO end Worker.new do # .. blocking IO end # implicitly waits for all workers to complete? end ``` If you following this model, the thread must be calling into `epoll` or `kqueue` in order to resume work. But based on what you've said, if you have several of the above threads running, and the thread itself is invoking `epoll_wait`, then it receives events for a different thread, how does that work? Do you send the events to the different thread? If you do that, what is the overhead? If you don't do that, do you move workers between threads? Then, why not consider the similar model to async which uses per-thread reactors. The workers do not move around threads, and the reactor does not need to send events to other threads. Thanks for your continued time and patience discussing these interesting issues. ---------------------------------------- Feature #13618: [PATCH] auto fiber schedule for rb_wait_for_single_fd and rb_waitpid https://bugs.ruby-lang.org/issues/13618#change-71723 * Author: normalperson (Eric Wong) * Status: Assigned * Priority: Normal * Assignee: normalperson (Eric Wong) * Target version: ---------------------------------------- ``` auto fiber schedule for rb_wait_for_single_fd and rb_waitpid Implement automatic Fiber yield and resume when running rb_wait_for_single_fd and rb_waitpid. The Ruby API changes for Fiber are named after existing Thread methods. main Ruby API: Fiber#start -> enable auto-scheduling and run Fiber until it automatically yields (due to EAGAIN/EWOULDBLOCK) The following behave like their Thread counterparts: Fiber.start - Fiber.new + Fiber#start (prelude.rb) Fiber#join - run internal scheduler until Fiber is terminated Fiber#value - ditto Fiber#run - like Fiber#start (prelude.rb) Right now, it takes over rb_wait_for_single_fd() and rb_waitpid() function if the running Fiber is auto-enabled (cont.c::rb_fiber_auto_sched_p) Changes to existing functions are minimal. New files (all new structs and relations should be documented): iom.h - internal API for the rest of RubyVM (incomplete?) iom_internal.h - internal header for iom_(select|epoll|kqueue).h iom_epoll.h - epoll-specific pieces iom_kqueue.h - kqueue-specific pieces iom_select.h - select-specific pieces iom_pingable_common.h - common code for iom_(epoll|kqueue).h iom_common.h - common footer for iom_(select|epoll|kqueue).h Changes to existing data structures: rb_thread_t.afrunq - list of fibers to auto-resume rb_vm_t.iom - Ruby I/O Manager (rb_iom_t) :) Besides rb_iom_t, all the new structs are stack-only and relies extensively on ccan/list for branch-less, O(1) insert/delete. As usual, understanding the data structures first should help you understand the code. Right now, I reuse some static functions in thread.c, so thread.c includes iom_(select|epoll|kqueue).h TODO: Hijack other blocking functions (IO.select, ...) I am using "double" for timeout since it is more convenient for arithmetic like parts of thread.c. Most platforms have good FP, I think. Also, all "blocking" functions (rb_iom_wait*) will have timeout support. ./configure gains a new --with-iom=(select|epoll|kqueue) switch libkqueue: libkqueue support is incomplete; corner cases are not handled well: 1) multiple fibers waiting on the same FD 2) waiting for both read and write events on the same FD Bugfixes to libkqueue may be necessary to support all corner cases. Supporting these corner cases for native kqueue was challenging, even. See comments on iom_kqueue.h and iom_epoll.h for nuances. Limitations Test script I used to download a file from my server: ----8<--- require 'net/http' require 'uri' require 'digest/sha1' require 'fiber' url = 'http://80x24.org/git-i-forgot-to-pack/objects/pack/pack-97b25a76c03b489d4cbbd85b12d0e1ad28717e55.idx' uri = URI(url) use_ssl = "https" == uri.scheme fibs = 10.times.map do Fiber.start do cur = Fiber.current.object_id # XXX getaddrinfo() and connect() are blocking # XXX resolv/replace + connect_nonblock Net::HTTP.start(uri.host, uri.port, use_ssl: use_ssl) do |http| req = Net::HTTP::Get.new(uri) http.request(req) do |res| dig = Digest::SHA1.new res.read_body do |buf| dig.update(buf) #warn "#{cur} #{buf.bytesize}\n" end warn "#{cur} #{dig.hexdigest}\n" end end warn "done\n" :done end end warn "joining #{Time.now}\n" fibs[-1].join(4) warn "joined #{Time.now}\n" all = fibs.dup warn "1 joined, wait for the rest\n" until fibs.empty? fibs.each(&:join) fibs.keep_if(&:alive?) warn fibs.inspect end p all.map(&:value) Fiber.new do puts 'HI' end.run.join ``` ---Files-------------------------------- 0001-auto-fiber-schedule-for-rb_wait_for_single_fd-and-rb.patch (82.8 KB) -- https://bugs.ruby-lang.org/ Unsubscribe: