Regex UpdateBumpalong logic is flawed and likely costing us perf #62676

stephentoub · 2021-12-11T21:00:56Z

We have an optimization that inserts an "UpdateBumpalong" node into the graph after a starting char loop:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

Lines 373 to 384 in 12a8819

	// Optimization: unnecessary re-processing of starting loops.
	// If an expression is guaranteed to begin with a single-character unbounded loop that isn't part of an alternation (in which case it
	// wouldn't be guaranteed to be at the beginning) or a capture (in which case a back reference could be influenced by its length), then we
	// can update the tree with a temporary node to indicate that the implementation should use that node's ending position in the input text
	// as the next starting position at which to start the next match. This avoids redoing matches we've already performed, e.g. matching
	// "\[email protected]" against "is this a valid [email protected]", the \w+ will initially match the "is" and then will fail to match the "@".
	// Rather than bumping the scan loop by 1 and trying again to match at the "s", we can instead start at the " ". For functional correctness
	// we can only consider unbounded loops, as to be able to start at the end of the loop we need the loop to have consumed all possible matches;
	// otherwise, you could end up with a pattern like "a{1,3}b" matching against "aaaabc", which should match, but if we pre-emptively stop consuming
	// after the first three a's and re-start from that position, we'll end up failing the match even though it should have succeeded. We can also
	// apply this optimization to non-atomic loops. Even though backtracking could be necessary, such backtracking would be handled within the processing
	// of a single starting position.

The code handling this node currently sets base.runtextpos to the current position, the idea being that the normal bumpalong of +1 would otherwise end up just processing the same position we've already tried. However, it looks like our current logic for handling this is flawed, in particular for non-atomic greedy loops. The current logic (in all engines) does the equivalent of:

base.runtextpos = pos;

but when backtracking kicks in for a greedy loop, we'll end up setting base.runtextpos to the last match position, and then again to that position -1, and then again to that position -2, all the way back to the beginning, which means it's not actually buying us anything in this case (in contrast, for atomic loops where there isn't backtracking, it will successfully track the last matching position). It seems like the logic should actually be more like:

if (base.runtextpos < pos)
{
    base.runtextpos = pos;
}

ghost · 2021-12-11T21:01:00Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

We have an optimization that inserts an "UpdateBumpalong" node into the graph after a starting char loop:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

Lines 373 to 384 in 12a8819

	// Optimization: unnecessary re-processing of starting loops.
	// If an expression is guaranteed to begin with a single-character unbounded loop that isn't part of an alternation (in which case it
	// wouldn't be guaranteed to be at the beginning) or a capture (in which case a back reference could be influenced by its length), then we
	// can update the tree with a temporary node to indicate that the implementation should use that node's ending position in the input text
	// as the next starting position at which to start the next match. This avoids redoing matches we've already performed, e.g. matching
	// "\[email protected]" against "is this a valid [email protected]", the \w+ will initially match the "is" and then will fail to match the "@".
	// Rather than bumping the scan loop by 1 and trying again to match at the "s", we can instead start at the " ". For functional correctness
	// we can only consider unbounded loops, as to be able to start at the end of the loop we need the loop to have consumed all possible matches;
	// otherwise, you could end up with a pattern like "a{1,3}b" matching against "aaaabc", which should match, but if we pre-emptively stop consuming
	// after the first three a's and re-start from that position, we'll end up failing the match even though it should have succeeded. We can also
	// apply this optimization to non-atomic loops. Even though backtracking could be necessary, such backtracking would be handled within the processing
	// of a single starting position.

The code handling this node currently sets base.runtextpos to the current position, the idea being that the normal bumpalong of +1 would otherwise end up just processing the same position we've already tried. However, it looks like our current logic for handling this is flawed, in particular for non-atomic greedy loops. The current logic (in all engines) does the equivalent of:

base.runtextpos = pos;

but when backtracking kicks in for a greedy loop, we'll end up setting base.runtextpos to the last match position, and then again to that position -1, and then again to that position -2, all the way back to the beginning, which means it's not actually buying us anything in this case (in contrast, for atomic loops where there isn't backtracking, it will successfully track the last matching position). It seems like the logic should actually be more like:

if (base.runtextpos < pos)
{
    base.runtextpos = pos;
}

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`
Milestone:	7.0.0

stephentoub added area-System.Text.RegularExpressions tenet-performance Performance related issue labels Dec 11, 2021

stephentoub added this to the 7.0.0 milestone Dec 11, 2021

joperezr mentioned this issue Dec 14, 2021

System.Text.RegularExpressions work planned for .NET 7 #62758

Closed

48 tasks

stephentoub self-assigned this Jan 3, 2022

stephentoub mentioned this issue Jan 5, 2022

Improve Regex UpdateBumpalong optimization for non-atomic and lazy loops #63398

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 5, 2022

stephentoub closed this as completed in #63398 Jan 15, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 15, 2022

ghost locked as resolved and limited conversation to collaborators Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex UpdateBumpalong logic is flawed and likely costing us perf #62676

Regex UpdateBumpalong logic is flawed and likely costing us perf #62676

stephentoub commented Dec 11, 2021

ghost commented Dec 11, 2021

Regex UpdateBumpalong logic is flawed and likely costing us perf #62676

Regex UpdateBumpalong logic is flawed and likely costing us perf #62676

Comments

stephentoub commented Dec 11, 2021

ghost commented Dec 11, 2021