From: "jirkamarsik (Jirka Marsik) via ruby-core" Date: 2024-02-29T17:21:18+00:00 Subject: [ruby-core:117014] [Ruby master Bug#20316] Regexp quantifier behavior changes in different syntactic context. Issue #20316 has been reported by jirkamarsik (Jirka Marsik). ---------------------------------------- Bug #20316: Regexp quantifier behavior changes in different syntactic context. https://bugs.ruby-lang.org/issues/20316 * Author: jirkamarsik (Jirka Marsik) * Status: Open * ruby -v: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- In the example below, adding a vertical bar to the end of a regular expression changes what is being matched by the preceding regular expression. ``` irb(main):001:0> /(|a){3}b/.match("aab") => # irb(main):002:0> /(|a){3}b|/.match("aab") => # ``` This is because the `{3}` quantifier is compiled into a `repeat` loop which uses the `OP_NULL_CHECK_END_MEMST` operation to perform a capture-group sensitive null-check after every loop iteration. The logic behind the `OP_NULL_CHECK_END_MEMST` operation is implemented using the `STACK_NULL_CHECK_MEMST` macro in `regexec.c`. The `STACK_NULL_CHECK_MEMST` macro checks whether any capture groups have been updated inside the last loop iteration and it does so by searching the stack for `STK_MEM_START` entries. However, such entries are not used for all capture groups. They are only used by capture groups which are listed in `bt_mem_start`. A capture group is marked as needing such bookkeeping only when it is either referenced by a back-reference or it appears in certain syntactic contexts (see e.g. around line 4096 of `regcomp.c`). This looks like an optimization that tries to avoid polluting the stack with `STK_MEM_START` entries in cases in which they are not needed. However, in this ca se, not putting `STK_MEM_START` entries on the stack leads to a different match result. In the example above, by adding a vertical bar to the end of the regexp, we have placed the group `(|a)` inside an alternation. This means that a different operation for tracking the state of the capture group is emitted in the compiled bytecode and this leads to a different result. Incidentally, this result should be the correct result, as the null-check ends up respecting the state of capture groups, as it tries to do in Ruby. This is the compilation and execution of `/(|a){3}b/.match("aab")` with ONIG_DEBUG_PARSE_TREE, ONIG_DEBUG_COMPILE and ONIG_DEBUG_MATCH. Note that `mem-start:1` is used for tracking the state of capture group 1. ``` PATTERN: /(|a){3}b/ (US-ASCII) {3,3} memory:1 a b optimize: EXACT anchor: [] sub anchor: [] exact: [b]: length: 1 code length: 37 0:[repeat:0:27] 7:[null-check-start:0] 10:[mem-start:1] 13:[push:(+5)] 18:[jump:(+2)] 23:[exact1:a] 25:[mem-end:1] 28:[null-check-end-memst:0] 31:[repeat-inc:0] 34:[exact1:b] 36:[end] match_at: str: 139809364021904 (0x7f27e77a9290), end: 139809364021906 (0x7f27e77a9292), start: 139809364021904 (0x7f27e77a9290), sprev: 0 ((nil)) size: 2, start offset: 0 ofs> str stk:type addr:opcode 0> "ab" 0:Alt 0:[repeat:0:27] 0> "ab" 1:Rep 7:[null-check-start:0] 0> "ab" 2:NulChS 10:[mem-start:1] 0> "ab" 2:NulChS 13:[push:(+5)] 0> "ab" 3:Alt 18:[jump:(+2)] 0> "ab" 3:Alt 25:[mem-end:1] 0> "ab" 3:Alt 28:[null-check-end-memst:0] NULL_CHECK_END_MEMST: skip id:0, s:139809364021904 (0x7f27e77a9290) 0> "ab" 3:Alt 34:[exact1:b] 0> "ab" 2:NulChS 23:[exact1:a] 1> "b" 2:NulChS 25:[mem-end:1] 1> "b" 2:NulChS 28:[null-check-end-memst:0] 1> "b" 2:NulChS 31:[repeat-inc:0] 1> "b" 3:RepInc 7:[null-check-start:0] 1> "b" 4:NulChS 10:[mem-start:1] 1> "b" 4:NulChS 13:[push:(+5)] 1> "b" 5:Alt 18:[jump:(+2)] 1> "b" 5:Alt 25:[mem-end:1] 1> "b" 5:Alt 28:[null-check-end-memst:0] NULL_CHECK_END_MEMST: skip id:0, s:139809364021905 (0x7f27e77a9291) 1> "b" 5:Alt 34:[exact1:b] 2> "" 5:Alt 36:[end] ``` This is the compilation and execution of `/(|a){3}b|/.match("aab")` (where the group `(|a)` appears inside an alternation) with ONIG_DEBUG_PARSE_TREE, ONIG_DEBUG_COMPILE and ONIG_DEBUG_MATCH. Note that `mem-start-push:1` is used for tracking the state of capture group 1, not `mem-start:1`. ``` PATTERN: /(|a){3}b|/ (US-ASCII) {3,3} memory:1 a b optimize: NONE anchor: [] code length: 47 0:[push:(+41)] 5:[repeat:0:27] 12:[null-check-start:0] 15:[mem-start-push:1] 18:[push:(+5)] 23:[jump:(+2)] 28:[exact1:a] 30:[mem-end:1] 33:[null-check-end-memst:0] 36:[repeat-inc:0] 39:[exact1:b] 41:[jump:(+0)] 46:[end] match_at: str: 140072854131304 (0x7f6540b69268), end: 140072854131306 (0x7f6540b6926a), start: 140072854131304 (0x7f6540b69268), sprev: 0 ((nil)) size: 2, start offset: 0 ofs> str stk:type addr:opcode 0> "ab" 0:Alt 0:[push:(+41)] 0> "ab" 1:Alt 5:[repeat:0:27] 0> "ab" 2:Rep 12:[null-check-start:0] 0> "ab" 3:NulChS 15:[mem-start-push:1] 0> "ab" 4:MemS 18:[push:(+5)] 0> "ab" 5:Alt 23:[jump:(+2)] 0> "ab" 5:Alt 30:[mem-end:1] 0> "ab" 5:Alt 33:[null-check-end-memst:0] 0> "ab" 5:Alt 36:[repeat-inc:0] 0> "ab" 6:RepInc 12:[null-check-start:0] 0> "ab" 7:NulChS 15:[mem-start-push:1] 0> "ab" 8:MemS 18:[push:(+5)] 0> "ab" 9:Alt 23:[jump:(+2)] 0> "ab" 9:Alt 30:[mem-end:1] 0> "ab" 9:Alt 33:[null-check-end-memst:0] NULL_CHECK_END_MEMST: skip id:0, s:140072854131304 (0x7f6540b69268) 0> "ab" 9:Alt 39:[exact1:b] 0> "ab" 8:MemS 28:[exact1:a] 1> "b" 8:MemS 30:[mem-end:1] 1> "b" 8:MemS 33:[null-check-end-memst:0] 1> "b" 8:MemS 36:[repeat-inc:0] 1> "b" 9:RepInc 12:[null-check-start:0] 1> "b" 10:NulChS 15:[mem-start-push:1] 1> "b" 11:MemS 18:[push:(+5)] 1> "b" 12:Alt 23:[jump:(+2)] 1> "b" 12:Alt 30:[mem-end:1] 1> "b" 12:Alt 33:[null-check-end-memst:0] 1> "b" 12:Alt 36:[repeat-inc:0] 1> "b" 13:RepInc 39:[exact1:b] 2> "" 13:RepInc 41:[jump:(+0)] 2> "" 13:RepInc 46:[end] ``` -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/