From: tenderlove@... Date: 2018-02-06T00:36:37+00:00 Subject: [ruby-core:85424] [Ruby trunk Feature#14370] Directly mark instruction operands and avoid mark_ary usage on rb_iseq_constant_body Issue #14370 has been updated by tenderlovemaking (Aaron Patterson). File direct_marking.diff added ko1 (Koichi Sasada) wrote: > >���Approximately 60% of the instruction sequences have 0 markable objects. Those 60% account for 35% of the total iseq_encoded that needs to be walked: > > It seems many extra data we are holding. If we shrink such extra space (capa) just after compiling, it can reduce memory. I tried this approach to compare, here is the patch I used: https://gist.githubusercontent.com/tenderlove/f7fe6e09d6f9cacd3299abdd39fb65b7/raw/9b7dd3718278f7be0eb18bedf94590573e313ab2/resize_array.diff This results in slightly more memory usage than directly marking iseq in my test (57300896 bytes vs 57112281 bytes). Unfortunately this patch also made more calls to malloc / free in order to allocate the smaller array. Maybe there is a way to do this in-place? > Of course, Aaron's approach helps more (because we don't need to keep `mark_ary` any more for referencing objects). But it is simple and no computation cost on marking phase. > > Thoughts? I wanted to see what it takes to remove `mark_ary` and what the benefits would be, so I did that too. I found removing the mark array resulted in a 6.6% memory reduction compared to trunk (vs 3% from direct marking). To remove the mark array, first I started marking once results: https://github.com/github/ruby/commit/b85d83613ad37d8c56277a7ac7d9bda69e1d8f67.diff The `once` instruction declares its operand is an inline cache struct, but it's actually a `iseq_inline_storage_entry`, so I had to add a conditional to cast the IC but only for `once` and `trace_once` instructions. Next I was able to remove the mark array: https://github.com/github/ruby/commit/29c95eec9c5ce6a9735169c5d6f1d08343d643d1.diff I added a flag to the iseq to indicate whether or not the iseq has markable objects: https://github.com/github/ruby/commit/fcb70543fcf73216e01634af1cc6e161eec3d2b6.diff After that patch the only iseqs that have performance impact are ones that actually need to be disassembled for marking. Finally, I added a new operand for `once` instructions, then removed the conditionals from the mark / compile function: https://github.com/github/ruby/commit/78ce485d9880e8694d586e93a48429f4aa907c18.diff I ran the same benchmark to compare all four versions: 1. trunk 2. Directly marking 3. Removing the mark array 4. Resizing the mark array ![Memory usage](https://user-images.githubusercontent.com/3124/35835276-9d8b37cc-0a8e-11e8-8a46-5ba391c0e666.png) From the graph you can see that resizing the mark array ends up with the same memory usage as direct marking, but takes more `malloc` calls to do it (the X axis is process size at each malloc, so the more samples the more malloc calls). Out of curiosity, I printed `GC.stat` after booting the Rails app for trunk, resizing the array, and removal: ~~~ key trunk resize remove count 46 47 49 heap_allocated_pages 664 656 505 heap_sorted_length 664 656 505 heap_allocatable_pages 0 0 0 heap_available_slots 270637 267387 205847 heap_live_slots 184277 184275 157995 heap_free_slots 86360 83112 47852 heap_final_slots 0 0 0 heap_marked_slots 184194 184192 157912 heap_eden_pages 664 656 505 heap_tomb_pages 0 0 0 total_allocated_pages 664 656 505 total_freed_pages 0 0 0 total_allocated_objects 986460 1005514 951828 total_freed_objects 802183 821239 793833 malloc_increase_bytes 1168 1168 1168 malloc_increase_bytes_limit 16777216 16777216 16777216 minor_gc_count 36 36 39 major_gc_count 10 11 10 remembered_wb_unprotected_objects 1640 1639 1660 remembered_wb_unprotected_objects_limit 3280 3278 3320 old_objects 177699 177482 153681 old_objects_limit 355398 354964 307362 oldmalloc_increase_bytes 1616 1616 2522576 oldmalloc_increase_bytes_limit 16777216 16777216 16777216 ~~~ As expected, removing the mark array decreased the number of allocated pages and the number of live slots. Anyway, I really like the idea of removing the mark array, though I realize this patch is a little complex. I think the savings are worth the change. I've attached an updated diff with all of the changes I've made so far. As a side note, I think we could remove `original_iseq` from the struct / mark array. AFAICT the `original_iseq` slot is just there because `rb_iseq_original_iseq` returns the underlying pointer and something needs to keep it alive. Maybe we could use a callback and keep the reference on the stack? I will send a patch for it so we can discuss on a different ticket. :) ---------------------------------------- Feature #14370: Directly mark instruction operands and avoid mark_ary usage on rb_iseq_constant_body https://bugs.ruby-lang.org/issues/14370#change-70206 * Author: tenderlovemaking (Aaron Patterson) * Status: Open * Priority: Normal * Assignee: ko1 (Koichi Sasada) * Target version: ---------------------------------------- Hi, I've attached a patch that changes rb_iseq_mark to directly mark instruction operands rather than adding them to a mark array. I observed a ~3% memory reduction by directly marking operands, and I didn't observe any difference in GC time. To test memory usage, I used a basic Rails application, logged all malloc / free calls to a file, then wrote a script that would sum the live memory at each sample (each sample being a call to malloc). I graphed these totals so that I could see the memory usage as malloc calls were made: ![memory usage graph](https://user-images.githubusercontent.com/3124/35020270-1b0ded20-fae0-11e7-9cbd-1d028a6c9484.png) The red line is trunk, the blue line is trunk + the patch I've attached. Since the X axis is sample number (not time), the blue line is not as long as the red line because the blue line calls `malloc` fewer times. The Y axis in the graph is the total number of "live" bytes that have been allocated (all allocations minus their corresponding frees). You can see from the graph that memory savings start adding up as more code gets loaded. I was concerned that this patch might impact GC time, but `make gcbench-rdoc` didn't seem to show any significant difference in GC time between trunk and this patch. If it turns out there is a performance impact, I think I could improve the time while still keeping memory usage low by generating a bitmap during iseq compilation. There is a bit more information where I've been working, but I think I've summarized everything here. https://github.com/github/ruby/pull/39 ---Files-------------------------------- iseq_mark.diff (6.28 KB) iseq_mark.diff (6.28 KB) iseq_mark.diff (7.26 KB) benchmark_methods.diff (1.23 KB) bench.rb (3.01 KB) direct_marking.diff (18.4 KB) -- https://bugs.ruby-lang.org/ Unsubscribe: