[#81492] [Ruby trunk Feature#13618] [PATCH] auto fiber schedule for rb_wait_for_single_fd and rb_waitpid — normalperson@...
Issue #13618 has been reported by normalperson (Eric Wong).
12 messages
2017/06/01
[#88695] Re: [Ruby trunk Feature#13618] [PATCH] auto fiber schedule for rb_wait_for_single_fd and rb_waitpid
— Eric Wong <normalperson@...>
2018/08/27
> https://bugs.ruby-lang.org/issues/13618
[#81569] [Ruby trunk Feature#12589] VM performance improvement proposal — vmakarov@...
Issue #12589 has been updated by vmakarov (Vladimir Makarov).
3 messages
2017/06/04
[#81581] [Ruby trunk Bug#13632] Not processable interrupt queue for a thread after it's notified that FD is closed in some other thread. — sir.nickolas@...
Issue #13632 has been reported by nvashchenko (Nikolay Vashchenko).
4 messages
2017/06/05
[#81590] Re: [ruby-cvs:66197] ko1:r59023 (trunk): revert r59020 because it may fail some tests sometimes on some environment (http://ci.rvm.jp/). This revert is to check the reason of failures. — Eric Wong <normalperson@...>
ko1@ruby-lang.org wrote:
5 messages
2017/06/06
[#81591] Re: [ruby-cvs:66197] ko1:r59023 (trunk): revert r59020 because it may fail some tests sometimes on some environment (http://ci.rvm.jp/). This revert is to check the reason of failures.
— Eric Wong <normalperson@...>
2017/06/06
Eric Wong <normalperson@yhbt.net> wrote:
[#81596] Re: [ruby-cvs:66203] Re: Re: ko1:r59023 (trunk): revert r59020 because it may fail some tests sometimes on some environment (http://ci.rvm.jp/). This revert is to check the reason of failures.
— Eric Wong <normalperson@...>
2017/06/06
Eric Wong <normalperson@yhbt.net> wrote:
[#81825] [Ruby trunk Feature#13697] [PATCH]: futex based thread primitives — normalperson@...
Issue #13697 has been reported by normalperson (Eric Wong).
3 messages
2017/06/29
[ruby-core:81664] [Ruby trunk Feature#12589] VM performance improvement proposal
From:
vmakarov@...
Date:
2017-06-13 15:04:28 UTC
List:
ruby-core #81664
Issue #12589 has been updated by vmakarov (Vladimir Makarov).
normalperson (Eric Wong) wrote:
> Eric Wong <normalperson@yhbt.net> wrote:
>
> Ah, I noticed you've removed "restrict" from your branch.
> Technically, wouldn't that be a regression from an optimization
> standpoint? (of course you know far more about compiler
> optimization than I).
>
It was just a try to achieve desired aliasing. But it is hard to achieve this.
There are too many VALUE * pointers in MRI VM. Removing restrict I added does
not worsen the code. Aliasing is a weak point of C. Therefore many
HPC developers still prefer Fortran in many cases.
I think changing type of pc might be more productive for achieving
necessary aliasing.
> > Perhaps -Werror=incompatible-pointer-types can be made a
> > standard warning flag for building Ruby, too...
>
> That removal was fine by me.
>
> Not a particularly focused review, just random stuff I'm
> spotting while taking breaks from other projects.
>
> Mostly just mundane systems stuff, nothing about the actual
> mjit changes.
>
Although it is random. Still it took your time to do this and
it is valuable to me. Thank you.
> * I noticed mjit.c uses it's own custom doubly-linked list for
> rb_mjit_batch_list. For me, that places a little extra burden
> in having extra code to review. Any particular reason ccan/list
> isn't used?
>
> Fwiw, the doubly linked list implementation in compile.c
> predated ccan/list; and I didn't want to:
>
I remember MRI lists when I worked on changing compile.c. Uniformity
of the code is important. I'll put it on my TODO list.
> a) risk throwing away known-working code
>
> b) introduce a the teeny performance regression for loop-heavy
> code:
>
> ccan/list is faster for insert/delete, but slightly
> slower iteration for loops from what I could tell.
>
>
> * The pthread_* stuff can probably use portable stuff defined in
> thread.c and thread_*.h. (Unfortunately for me) Ruby needs to
> support non-Free platforms :<
>
>
> * fopen should probably be replaced by something which sets
> cloexec; since the "e" flag of fopen is non-portable.
>
> Perhaps rb_cloexec_open() + fdopen().
>
>
> * It looks like meant to use fflush instead of fsync; fflush is
> all that's needed to ensure other processes can see the file
> changes (and it's done transparently by fclose). fsync is to
> ensure the file is committed to stable storage, and some folks
> still use stable storage for /tmp. fsync before the final
> fflush is wrong, even, as the kernel may not have all the
> data from userspace
>
>
Yes, my mistake. I'll correct this. fsync is also worse with
the performance point of view.
> * get_uniq_fname should respect alternate tmpdirs like Dir.tmpdir, does
> (in lib/dir.rb)
>
>
I'll investigate this. For JIT performance the used temp files should be
in a memory FS. If alternative tempdirs provide this, I should switch to it.
> * we can use vfork + execve instead of fork to speed up process
> creation; just need to move the fopen (which can call malloc)
> into the parent. We've already used vfork for Process.spawn,
> system(), ``, IO.popen for a few years.
>
Yes, it can be a performance win although probably small one.
>
> None of these are super important; and I can eventually take
> take some time to make send you patches or pull requests (via
> email/redmine)
>
Only if it is not a burden for you. You already gave a fresh look
at the code and proposed valuable improvements.
I just focused on Linux and MacOS a bit. I ignored other OSes,
e.g. Windows. My major goal was to justify the approach with the
performance point of view and then work more on MJIT portability.
Now I can say it works although a lot of performance improvements still
can and should be done. I think the portability work already could
start.
> rb_mjit_min_header-2.5.0.h takes forever to build...
>
Yes, it is slow (about 75 sec on i3-7100). It is a ruby script trying
to remove unnecessary C definitions/declarations. After removing some
C code it calls C compiler to check that the code is valid.
I tried many things to speed it up, e.g. checking that the header will
be the same, removing several declarations at once, using special C
compiler options to speed up the check. But I got your message that it
is still slow.
I'll think about further speed up. May be I'll try running a few C
compilations in parallel or generating a bigger header as
loading/reading pre-compiled header takes a tiny part of even
a small method compilation.
> Thank again for taking your time to work on Ruby!
Eric, thank you for your time reviewing my code.
----------------------------------------
Feature #12589: VM performance improvement proposal
https://bugs.ruby-lang.org/issues/12589#change-65359
* Author: vmakarov (Vladimir Makarov)
* Status: Open
* Priority: Normal
* Assignee:
* Target version:
----------------------------------------
Hello. I'd like to start a big MRI project but I don't want to
disrupt somebody else plans. Therefore I'd like to have MRI
developer's opinion on the proposed project or information if somebody
is already working on an analogous project.
Basically I want to improve overall MRI VM performance:
* First of all, I'd like to change VM insns and move from
**stack-based** insns to **register transfer** ones. The idea behind
it is to decrease VM dispatch overhead as approximately 2 times
less RTL insns are necessary than stack based insns for the same
program (for Ruby it is probably even less as a typical Ruby program
contains a lot of method calls and the arguments are passed through
the stack).
But *decreasing memory traffic* is even more important advantage
of RTL insns as an RTL insn can address temporaries (stack) and
local variables in any combination. So there is no necessity to
put an insn result on the stack and then move it to a local
variable or put variable value on the stack and then use it as an
insn operand. Insns doing more also provide a bigger scope for C
compiler optimizations.
The biggest changes will be in files compile.c and insns.def (they
will be basically rewritten). **So the project is not a new VM
machine. MRI VM is much more than these 2 files.**
The disadvantage of RTL insns is a bigger insn memory footprint
(which can be upto 30% more) although as I wrote there are fewer
number of RTL insns.
Another disadvantage of RTL insns *specifically* for Ruby is that
insns for call sequences will be basically the same stack based
ones but only bigger as they address the stack explicitly.
* Secondly, I'd like to **combine some frequent insn sequences** into
bigger insns. Again it decreases insn dispatch overhead and
memory traffic even more. Also it permits to remove some type
checking.
The first thing on my mind is a sequence of a compare insn and a
branch and using immediate operands besides temporary (stack) and
local variables. Also it is not a trivial task for Ruby as the
compare can be implemented as a method.
I already did some experiments. RTL insns & combining insns permits
to speed the following micro-benchmark in more 2 times:
```
i = 0
while i<30_000_000 # benchmark loop 1
i += 1
end
```
The generated RTL insns for the benchmark are
```
== disasm: #<ISeq:<main>@while.rb>======================================
== catch table
| catch type: break st: 0007 ed: 0020 sp: 0000 cont: 0020
| catch type: next st: 0007 ed: 0020 sp: 0000 cont: 0005
| catch type: redo st: 0007 ed: 0020 sp: 0000 cont: 0007
|------------------------------------------------------------------------
local table (size: 2, temp: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] i
0000 set_local_val 2, 0 ( 1)
0003 jump 13 ( 2)
0005 jump 13
0007 plusi <callcache>, 2, 2, 1, -1 ( 3)
0013 btlti 7, <callcache>, -1, 2, 30000000, -1 ( 2)
0020 local_ret 2, 0 ( 3)
```
In this experiment I ignored trace insns (that is another story) and a
complication that a integer compare insn can be re-implemented as a
Ruby method. Insn bflti is combination of LT immediate compare and
branch true.
A modification of fib benchmark is sped up in 1.35 times:
```
def fib_m n
if n < 1
1
else
fib_m(n-1) * fib_m(n-2)
end
end
fib_m(40)
```
The RTL code of fib_m looks like
```
== disasm: #<ISeq:fib_m@fm.rb>==========================================
local table (size: 2, temp: 3, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] n<Arg>
0000 bflti 10, <callcache>, -1, 2, 1, -1 ( 2)
0007 val_ret 1, 16
0010 minusi <callcache>, -2, 2, 1, -2 ( 5)
0016 simple_call_self <callinfo!mid:fib_m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>, -1
0020 minusi <callcache>, -3, 2, 2, -3
0026 simple_call_self <callinfo!mid:fib_m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>, -2
0030 mult <callcache>, -1, -1, -2, -1
0036 temp_ret -1, 16
```
In reality, the improvement of most programs probably will be about
10%. That is because of very dynamic nature of Ruby (a lot of calls,
checks for redefinition of basic type operations, checking overflows
to switch to GMP numbers). For example, integer addition can not be
less than about x86-64 17 insns out of the current 50 insns on the
fast path. So even if you make the rest (33) insns 2 times faster,
the improvement will be only 30%.
A very important part of MRI performance improvement is to make calls
fast because there are a lot of them in Ruby but as I read in some
Koichi Sasada's presentations he pays a lot of attention to it. So I
don't want to touch it.
* Thirdly. I want to implement the insns as small inline functions
for future AOT compiler, of course, if the projects described
above are successful. It will permit easy AOT generation of C code
which will be basically calls of the functions.
I'd like to implement AOT compiler which will generate a Ruby
method code, call a C compiler to generate a binary shared code
and load it into MRI for subsequent calls. The key is to minimize
the compilation time. There are many approaches to do it but I
don't want to discuss it right now.
C generation is easy and most portable implementation of AOT but
in future it is possible to use GCC JIT plugin or LLVM IR to
decrease overhead of C scanner/parser.
C compiler will see a bigger scope (all method insns) to do
optimizations. I think using AOT can give another 10%
improvement. It is not that big again because of dynamic nature
of Ruby and any C compiler is not smart enough to figure out
aliasing for typical generated C program.
The life with the performance point of view would be easy if Ruby
did not permit to redefine basic operations for basic types,
e.g. plus for integer. In this case we could evaluate types of
operands and results using some data flow analysis and generate
faster specialized insns. Still a gradual typing if it is
introduced in future versions of Ruby would help to generate such
faster insns.
Again I wrote this proposal for discussion as I don't want to be in
a position to compete with somebody else ongoing big project. It
might be counterproductive for MRI development. Especially I don't
want it because the project is big and long and probably will have a
lot of tehcnical obstacles and have a possibilty to be a failure.
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>