From: mame@... Date: 2018-09-26T23:33:38+00:00 Subject: [ruby-core:89184] [Ruby trunk Feature#15166][Assigned] 2.5 times faster implementation than current gcd implmentation Issue #15166 has been updated by mame (Yusuke Endoh). Status changed from Open to Assigned Assignee set to watson1978 (Shizuo Fujita) Thanks. Assigned to @watson1978. It would be very helpful if you could provide us a patch and perform the benchmark with Ruby implementation, not a toy benchmark program. Note that `__builtin_ctzl` is not available on some compilers. You need to check if it is available or not. ---------------------------------------- Feature #15166: 2.5 times faster implementation than current gcd implmentation https://bugs.ruby-lang.org/issues/15166#change-74211 * Author: jzakiya (Jabari Zakiya) * Status: Assigned * Priority: Normal * Assignee: watson1978 (Shizuo Fujita) * Target version: ---------------------------------------- This is to be more explicit (and accurate) than https://bugs.ruby-lang.org/issues/15161 This is my modified gcd benchmarks code, originally presented by Daniel Lemire (see 15161). https://gist.github.com/jzakiya/44eae4feeda8f6b048e19ff41a0c6566 Ruby's current implementation of Stein's gcd algorithm is only slightly faster than the code posted on the wikepedia page, and over 2.5 times slower than the fastest implementation in the benchmarks. ``` [jzakiya@localhost ~]$ ./gcdbenchmarks gcd between numbers in [1 and 2000] gcdwikipedia7fast32 : time = 99 gcdwikipedia4fast : time = 121 gcdFranke : time = 126 gcdwikipedia3fast : time = 134 gcdwikipedia2fastswap : time = 136 gcdwikipedia5fast : time = 139 gcdwikipedia7fast : time = 138 gcdwikipedia2fast : time = 136 gcdwikipedia6fastxchg : time = 144 gcdwikipedia2fastxchg : time = 156 gcd_iterative_mod : time = 210 gcd_recursive : time = 215 basicgcd : time = 211 rubygcd : time = 267 gcdwikipedia2 : time = 321 gcd between numbers in [1000000001 and 1000002000] gcdwikipedia7fast32 : time = 100 gcdwikipedia4fast : time = 121 gcdFranke : time = 126 gcdwikipedia3fast : time = 134 gcdwikipedia2fastswap : time = 136 gcdwikipedia5fast : time = 138 gcdwikipedia7fast : time = 138 gcdwikipedia2fast : time = 136 gcdwikipedia6fastxchg : time = 144 gcdwikipedia2fastxchg : time = 156 gcd_iterative_mod : time = 210 gcd_recursive : time = 215 basicgcd : time = 211 rubygcd : time = 269 gcdwikipedia2 : time = 323 ``` This is Ruby's code per: https://github.com/ruby/ruby/blob/3abbaab1a7a97d18f481164c7dc48749b86d7f39/rational.c#L285-L307 which is basically the wikepedia implementation. ``` inline static long i_gcd(long x, long y) { unsigned long u, v, t; int shift; if (x < 0) x = -x; if (y < 0) y = -y; if (x == 0) return y; if (y == 0) return x; u = (unsigned long)x; v = (unsigned long)y; for (shift = 0; ((u | v) & 1) == 0; ++shift) { u >>= 1; v >>= 1; } while ((u & 1) == 0) u >>= 1; do { while ((v & 1) == 0) v >>= 1; if (u > v) { t = v; v = u; u = t; } v = v - u; } while (v != 0); return (long)(u << shift); } ``` This is the fastest implementation from the benchmarks. (I originally, wrongly, cited the implementation in the article, which is 4|5th fastest in benchmarks, but still almost 2x faster than the Ruby implementation.) ``` // based on wikipedia's article, // fixed by D. Lemire, K. Willets unsigned int gcdwikipedia7fast32(unsigned int u, unsigned int v) { int shift, uz, vz; if ( u == 0) return v; if ( v == 0) return u; uz = __builtin_ctz(u); vz = __builtin_ctz(v); shift = uz > vz ? vz : uz; u >>= uz; do { v >>= vz; int diff = v; diff -= u; if ( diff == 0 ) break; vz = __builtin_ctz(diff); if ( v < u ) u = v; v = abs(diff); } while( 1 ); return u << shift; } ``` The key to speeding up all the algorithms is using the ``__builtin_ctz(x)`` directive to determine the number of trailing binary '0's. -- https://bugs.ruby-lang.org/ Unsubscribe: