From: eregontp@... Date: 2020-06-21T09:54:43+00:00 Subject: [ruby-core:98903] [Ruby master Feature#16254] MRI internal: Define built-in classes in Ruby with `__intrinsic__` syntax Issue #16254 has been updated by Eregon (Benoit Daloze). Great to see `Primitive.name` is used now :) Somehow Redmine didn't send me any notification for the issue being closed. @k0kubun agreed for the first one. ---------------------------------------- Feature #16254: MRI internal: Define built-in classes in Ruby with `__intrinsic__` syntax https://bugs.ruby-lang.org/issues/16254#change-86286 * Author: ko1 (Koichi Sasada) * Status: Closed * Priority: Normal * Assignee: ko1 (Koichi Sasada) ---------------------------------------- # Abstract MRI defines most of built-in classes in C with C-APIs like `rb_define_method()`. However, there are several issues using C-APIs. A few methods are defined in Ruby written in `prelude.rb`. However, we can not define all of classes because we can not touch deep data structure in Ruby. Furthermore, there are performance issues if we write all of them in Ruby. To solve this situation, I want to suggest written in Ruby with C intrinsic functions. This proposal is same as my RubyKaigi 2019 talk . # Terminology * C-methods: methods defined in C (defined with `rb_define_method()`, etc). * Ruby-methods: methods defined in Ruby. * ISeq: The body of `RUbyVM::InstructionSequence` object which represents bytecode for VM. # Background / Problem / Idea ## Written in C As you MRI developers know, most of methods are written in C with C-APIs. However, there are several issues. ### (1) Annotation issues (compare with Ruby methods) For example, C-methods defined by C-APIs doesn't have `parameters` information which are returned by `Method#parameters`, because there is way to define parameters for C methods. There are proposals to add parameter name information for C-methods, however, I think it will introduce new complex C-APIs and introduce additional overhead on boot time. -> Idea; Writing methods in Ruby will solve this issue. ### (2) Annotation issues (for further optimization) It is useful to know the methods attribute, for example, the method causes no side-effect (a pure method). Labeling all of methods including user program's methods doesn't seem good idea (not Ruby-way). But I think annotating built-in methods is good way because we can manage (and we can remove them when we can make good analyzer). There are no way to annotate this kind of attributes. -> Idea: Writing methods in Ruby will make it easy to introduce new annotations. ### (3) Performance issue There are several features which are slower in C than written in Ruby. * exception handling (`rb_ensure()`, etc) because we need to capture context with `setjmp` on C-methods. Ruby-methods doesn't need to capture any context for exception handling. * Passing keyword parameters because Ruby-methods doesn't need to make a Hash object to pass the keyword parameters if they are passed with explicit keyword parameters (`foo(k1: v1, k2: v2)`). -> Idea: Writing methods in Ruby makes them faster. ### (4) Productivity It is tough to write some features in C: For example, it is easy to write `rescue` syntax in Ruby: ```ruby # in Ruby def dummy_func_rescue nil rescue nil end ``` But it is difficult to write/read in C: ```C static VALUE dummy_body(VALUE self) { return Qnil; } static VALUE dummy_rescue(VALUE self) { return Qnil; } static VALUE tdummy_func_rescue(VALUE self) { return rb_rescue(dummy_body, self, dummy_rescue, self); } ``` (trained MRI developer can say it is not tough, though :p) -> Idea: Writing methods in Ruby makes them easy. ### (5) API change To introduce `Guild`, I want to pass a "context" parameter (as a first parameter) for each C-functions like `mrb_state` on mruby. This is because getting it from TLS (Thread-local-storage) is high-cost operation on dynamic library (libruby). Maybe nobody allow me to change the specification of functions used by `rb_define_method()`. -> Idea: But introduce new method definition framework, we can move and change the specification, I hope. Of course, we can remain current `rb_define_method()` APIs (with additional cost on `Guild` available MRI). ## Written in Ruby in `prelude.rb` There is a file `prelude.rb` which are loaded at boot time. This file is used to define several methods, to reduce keyword parameters overhead, for example (`IO#read_nonblock`, `TracePoint#enable`). However, writing all of methods in Ruby is not possible because: * (1) feasibility issue (we can not access internal data structure) * (2) performance issue (slow in general, of course) * (3) atomicity issue (GVL/GIL) To solve (1), we can provide low-level C-methods to implement high-level (normal built-in) methods. However issues (2) and (3) are not solved. (From CS researchers perspective, making clever compiler will solve them, like JVM, etc, But we don't have it yet) -> Idea: Writing method body in C is feasible. # Proposal (1) Introducing `intrinsic` mechanism to define built-in methods in Ruby. (2) Load from binary format to reduce startup time. ## (1) Intrinsic function ### Calling intrinsic function syntax in Ruby To define built-in methods, introduce special Ruby syntax `__intrinsic__.func(args)`. In this case, registered intrinsic function `func()` is called with `args`. In normal Ruby program, `__intrinsic__` is a local variable or a method. However, running on special mode, they are parsed as intrinsic function call. Intrinsic functions can not be called with: * block * keyword arguments * splat arguments ### Development step with intrinsic functions (1) Write a class/module in Ruby with intrinsic function. ```ruby # string.rb class String def length __intrinsic__.str_length end end ``` (2) Implement intrinsic functions It is almost same as functions used by `rb_define_method()`. However it will accept context parameter as the first parameter. (`rb_execution_context_t` is too long, so we can rename it, `rb_state` for example) ```C static VALUE str_length(rb_execution_context_t *ec, VALUE self) { return LONG2NUM(RSTRING_LEN(self)); } ``` (3) Define an intrinsic function table and load `.rb` file with the table. ```C Init_String(void) { ... static const rb_export_intrinsic_t table[] = { RB_EXPORT_INTRINSIC(str_length, 0), // 0 is arity ... }; rb_vm_builtin_load("string", table); } ``` ### Example There are two examples: (1) Comparable module: https://gist.github.com/ko1/7f18e66d1ae25bb30c7e823aa57f0d31 (2) TracePoint class: https://gist.github.com/ko1/969e5690cda6180ed989eb79619ca612 ## (2) Load from binary file with lazy loading Loading many ".rb" files slows down startup time. We have `ISeq#to_binary` method to generate compiled binary data so that we can eliminate parse/compile time. Fortunately, [Feature #16163] makes binary data small. Furthermore, enabling "lazy loading" feature improves startup time because we don't need to generate complete ISeqs. `USE_LAZY_LOAD` in vm_core.h enables this feature. We need to combine binary. There are several way (convert into C's array, concat with objcopy if available and so on). # Evaluation Evaluations are written in my RubyKaigi 2019 presentation: Points: * Calling overhead of Ruby mehtods with intrinsic functions * Normal case, it is almost same as C-methods using optimized VM instructions. * With keyword parameters, it is faster than C-methods. * With optional parameters, it is x2 slower so it should be solved (*1). * Loading overhead * Requiring ".rb" files is about x15 slower than defining C methods. * Loading binary data with lazy loading technique is about x2 slower than C methods. Not so bad result. * At RubyKaigi 2019, the binary data was very huge, but [Feature #16163] reduces the size of binary data. [*1] Introducing special "overloading" specifier can solve it because we don't need to assign optional parameters. First method lookup can be slowed down, but we can cache the method lookup results (with arity). ```ruby # example syntax overload def foo(a) __intrinsic__.foo1(a) end overload def foo(a, b) __intrinsic__.foo2(a, b) end ``` # Implementation Done: * Compile calling intrinsic functions (.rb) * Exporting intrinsic function table (.c) Not yet: * Loading from binary mechanism * Attribute syntax * most of built-in class replacement Now, miniruby and ruby (libruby) load '*.rb' files directly. However, ruby (libruby) should load compiled binary file. # Discussion ## Do we rewrite all of built-in classes at once? No. We can try and migrate them. ## Do we support intrinsic mechanism for C-extension libraries? Maybe in future. Now we can try it on MRI cores. ## `__intrinsic__` keyword On my RubyKaigi 2019 talk, I proposed `__C__`, but I think `__intrinsic__` is more descriptive (but a bit long). Another idea is `RubyVM::intrinsic.func(...)`. I have no strong opinion. We can change this syntax until we expose this syntax for C-extensions. ## Can we support `__intrinsic__` in normal Ruby script? No. This feature is only for built-in features. As I described,���calling intrinsic function syntax has several restriction compare with normal method calls, so that I think they are not exposed as normal Ruby programs, IMO. ## Should we maintain intrinsic function table? Now, yes. And we need to make this table automatically because manual operations can introduce mistake very easily. Corresponding ".rb" file (`trace_point.rb`, for example) knows which intrinsic functions are needed. Parsing ".rb" file can generate the table automatically. However, we need a latest version Ruby to parse the scripts if they uses syntax which are supported by latest version of Ruby. For example, we need Ruby 2.7 master to parse a script which uses pattern matching syntax. However, the system's ruby (`BASE_RUBY`) should be older version. This is one of bootstrap problem. This is "chicken-and-egg" problem. There are several ideas. (1) Parse a ".c" file to generate a table using function attribute. ```C INTRINSIC_FUNCTION static VALUE str_length(...) ... ``` (2) Build another ruby parser with source code, "parse-ruby". * 1. generate parse-ruby with C code. * 2. run parse-ruby to generate tables by parsing ".rb" files. This process is written in C. * 3. build miniruby and ruby with generated table. We can make it, but it introduces new complex build process. (3) Restrict ".rb" syntax Restrict syntax which can be used by `BASE_RUBY` for built-in ".rb" files. It is easy to list up intrinsic functions using Ripper or AST or `ISeq#to_a`. (3) is most easy but not so cool. (2) is flexible, but it needs implementation cost and increases build complexity. ## Path of '*.rb' files and install or not The path of `prelude.rb` is ``. We have several options. * (1) Don't install ".rb" files and make these path ``, for example. * (2) Install ".rb" and make these paths non-existing paths such as `/installdir/lib/builtin/trace_point.rb`. * (3) Install ".rb" and make these paths real paths. We will translate ".rb" files into binary data and link them into `ruby` (`libruby`). So the modification of installed ".rb" files are not affect the behavior. It can introduce confusion so that I wrote (1) and (2). For (3), it is possible to load ".rb" files if there is modification (maybe detect by modified date) and load from them. But it will introduce an overhead (disk access overhead). ## Compatibility issue? There are several compatibility issues. For example, `TracePoint` `c-call` events are changed to `call` events. And there are more incompatibles. We need to check them carefully. ## Bootstrap issue? Yes, there are. Loading `.rb` files at boot timing of an interpreter can cause problem. For example, before initializing String class, the class of String literal is 0 (because String class is not generated). I introduces several workarounds but we need to modify more. # Conclusion How about to introduce this mechanism and try it on Ruby 2.7? We can revert these changes if we found any troubles, if we don't expose this mechanism and only internal changes. -- https://bugs.ruby-lang.org/ Unsubscribe: