From: duerst@... Date: 2021-07-14T11:59:02+00:00 Subject: [ruby-core:104598] [Ruby master Feature#17837] Add support for Regexp timeouts Issue #17837 has been updated by duerst (Martin D�rst). nobu (Nobuyoshi Nakada) wrote in #note-22: > I made a patch for `Regexp#backtrack_limit=`. > This seems no significant performance difference. > > https://github.com/ruby/ruby/compare/master...nobu:Regexp.backtrack_limit?expand=1 I have looked at this patch. I think this is the general direction to go. I also think that the interface/API looks good, maybe having a keyword argument on Regexp.new, too, would be a good addition. I installed the patch and ran some very experiments. I started with a very slow Regexp that I use to show my students. It can be made of any length n, but gets really slow when n grows, to the order of O(2**n). I realized that it actually may be some kind of worst case, because it's not really doing much except for backtracking. That means that the overhead of counting the backtracks will show very clearly. Any more 'average' example should slow down quite a bit less. So here is the program I used: ```Ruby HAS_BACKTRACK_LIMIT = Regexp.respond_to? :backtrack_limit def slow_find (n) s = 'a' * n r = Regexp.compile('a?' * n + s) m = nil t1 = Time.now 10.times { m = s.match(r) } t2 = Time.now print "n: #{n}, time: #{t2-t1}/10" print ", backtrack_count: #{m.backtrack_count}" if HAS_BACKTRACK_LIMIT puts end (25..29).each do |i| slow_find i end ``` You can easily adjust this by changing the part (25..29) (the range of s n's used) and the two instances of '10' (the number of times a match is run in a row). Here are the results for the patch: ``` duerst@Kloentalersee:~/nobuRuby$ ./ruby try_backtrack_limit.rb n: 25, time: 2.8453695/10, backtrack_count: 33554431 n: 26, time: 5.6392941/10, backtrack_count: 67108863 n: 27, time: 11.3532755/10, backtrack_count: 134217727 n: 28, time: 24.1388335/10, backtrack_count: 268435455 n: 29, time: 49.084651/10, backtrack_count: 536870911 duerst@Kloentalersee:~/nobuRuby$ ./ruby try_backtrack_limit.rb n: 25, time: 2.7971486/10, backtrack_count: 33554431 n: 26, time: 5.9609293/10, backtrack_count: 67108863 n: 27, time: 12.126138/10, backtrack_count: 134217727 n: 28, time: 24.7895166/10, backtrack_count: 268435455 n: 29, time: 49.6923646/10, backtrack_count: 536870911 duerst@Kloentalersee:~/nobuRuby$ ./ruby try_backtrack_limit.rb n: 25, time: 2.8213545/10, backtrack_count: 33554431 n: 26, time: 6.1295964/10, backtrack_count: 67108863 n: 27, time: 12.1948968/10, backtrack_count: 134217727 n: 28, time: 24.6284841/10, backtrack_count: 268435455 n: 29, time: 48.6898231/10, backtrack_count: 536870911 ``` And here are the results without the patch: ``` duerst@Kloentalersee:~/ruby3$ ./ruby ../nobuRuby/try_backtrack_limit.rb n: 25, time: 2.6384167/10 n: 26, time: 5.2395088/10 n: 27, time: 11.3225276/10 n: 28, time: 23.289667/10 n: 29, time: 45.9637488/10 duerst@Kloentalersee:~/ruby3$ ./ruby ../nobuRuby/try_backtrack_limit.rb n: 25, time: 2.5845849/10 n: 26, time: 5.2094378/10 n: 27, time: 10.5159888/10 n: 28, time: 22.5549276/10 n: 29, time: 45.600226/10 duerst@Kloentalersee:~/ruby3$ ./ruby ../nobuRuby/try_backtrack_limit.rb n: 25, time: 2.5993792/10 n: 26, time: 5.2897985/10 n: 27, time: 11.2203586/10 n: 28, time: 23.1157868/10 n: 29, time: 45.0094087/10 ``` These results where obtained on a WSL2/Ubuntu install on Windows 10. All other user programs were switched off, which on Windows doesn't mean there's nothing else running, of course. It should be clear from the above results that the difference is around 5%, maybe a bit higher, but not 10%. As I already said, that's for what I think is pretty much the worst case. All this Regexp does in backtrack in a binary tree of depth n, testing out all the combinations of choosing 'a' or not 'a' in the first half of the Regexp (which is just "a?a?a?a?...."). Every time it looks for an 'a' in that part, it finds one. But then (except for the very very last try) it cannot match the second part of the Regexp (just n "a"s) to the rest of the string (which is also just n "a"s). For that, it actually doesn't need time, because this part is optimized with a Boyer-Moor algorithm, which means it just checks that the last "a" in the Regexp is beyond the actual string and so there's no match. This can be seen from the debug output when compiling Ruby with ``` #define ONIG_DEBUG_PARSE_TREE #define ONIG_DEBUG_COMPILE ``` in regint.h, which results in the following: ``` $ ./ruby -e 'Regexp.new("a?a?a?aaa")' `RubyGems' were not loaded. `error_highlight' was not loaded. `did_you_mean' was not loaded. PATTERN: /a?a?a?aaa/ (US-ASCII) {0,1} a {0,1} a {0,1} a aaa optimize: EXACT_BM anchor: [] sub anchor: [] exact: [aaa]: length: 3 code length: 26 0:[push:(+2)] 5:[exact1:a] 7:[push:(+2)] 12:[exact1:a] 14:[push:(+2)] 19:[exact1:a] 21:[exact3:aaa] 25:[end] ``` So this should give everybody some indication of the worst slowdown with this new backtrack_limit feature. Results for some more "average" scenarios would also help. ---------------------------------------- Feature #17837: Add support for Regexp timeouts https://bugs.ruby-lang.org/issues/17837#change-92884 * Author: sam.saffron (Sam Saffron) * Status: Open * Priority: Normal ---------------------------------------- ### Background ReDoS are a very common security issue. At Discourse we have seen a few through the years. https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS In a nutshell there are 100s of ways this can happen in production apps, the key is for an attacker (or possibly innocent person) to supply either a problematic Regexp or a bad string to test it with. ``` /A(B|C+)+D/ =~ "A" + "C" * 100 + "X" ``` Having a problem Regexp somewhere in a large app is a universal constant, it will happen as long as you are using Regexps. Currently the only feasible way of supplying a consistent safeguard is by using `Thread.raise` and managing all execution. This kind of pattern requires usage of a third party implementation. There are possibly issues with jRuby and Truffle when taking approaches like this. ### Prior art .NET provides a `MatchTimeout` property per: https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.matchtimeout?view=net-5.0 Java has nothing built in as far as I can tell: https://stackoverflow.com/questions/910740/cancelling-a-long-running-regex-match Node has nothing built in as far as I can tell: https://stackoverflow.com/questions/38859506/cancel-regex-match-if-timeout Golang and Rust uses RE2 which is not vulnerable to DoS by limiting features (available in Ruby RE2 gem) ``` irb(main):003:0> r = RE2::Regexp.new('A(B|C+)+D') => # irb(main):004:0> r.match("A" + "C" * 100 + "X") => nil ``` ### Proposal Implement `Regexp.timeout` which allow us to specify a global timeout for all Regexp operations in Ruby. Per Regexp would require massive application changes, almost all web apps would do just fine with a 1 second Regexp timeout. If `timeout` is set to `nil` everything would work as it does today, when set to second a "monitor" thread would track running regexps and time them out according to the global value. ### Alternatives I recommend against a "per Regexp" API as this decision is at the application level. You want to apply it to all regular expressions in all the gems you are consuming. I recommend against a move to RE2 at the moment as way too much would break ### See also: https://people.cs.vt.edu/davisjam/downloads/publications/Davis-Dissertation-2020.pdf https://levelup.gitconnected.com/the-regular-expression-denial-of-service-redos-cheat-sheet-a78d0ed7d865 -- https://bugs.ruby-lang.org/ Unsubscribe: