Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex UpdateBumpalong logic is flawed and likely costing us perf #62676

Closed
stephentoub opened this issue Dec 11, 2021 · 1 comment · Fixed by #63398
Closed

Regex UpdateBumpalong logic is flawed and likely costing us perf #62676

stephentoub opened this issue Dec 11, 2021 · 1 comment · Fixed by #63398
Assignees
Milestone

Comments

@stephentoub
Copy link
Member

We have an optimization that inserts an "UpdateBumpalong" node into the graph after a starting char loop:

// Optimization: unnecessary re-processing of starting loops.
// If an expression is guaranteed to begin with a single-character unbounded loop that isn't part of an alternation (in which case it
// wouldn't be guaranteed to be at the beginning) or a capture (in which case a back reference could be influenced by its length), then we
// can update the tree with a temporary node to indicate that the implementation should use that node's ending position in the input text
// as the next starting position at which to start the next match. This avoids redoing matches we've already performed, e.g. matching
// "\[email protected]" against "is this a valid [email protected]", the \w+ will initially match the "is" and then will fail to match the "@".
// Rather than bumping the scan loop by 1 and trying again to match at the "s", we can instead start at the " ". For functional correctness
// we can only consider unbounded loops, as to be able to start at the end of the loop we need the loop to have consumed all possible matches;
// otherwise, you could end up with a pattern like "a{1,3}b" matching against "aaaabc", which should match, but if we pre-emptively stop consuming
// after the first three a's and re-start from that position, we'll end up failing the match even though it should have succeeded. We can also
// apply this optimization to non-atomic loops. Even though backtracking could be necessary, such backtracking would be handled within the processing
// of a single starting position.

The code handling this node currently sets base.runtextpos to the current position, the idea being that the normal bumpalong of +1 would otherwise end up just processing the same position we've already tried. However, it looks like our current logic for handling this is flawed, in particular for non-atomic greedy loops. The current logic (in all engines) does the equivalent of:

base.runtextpos = pos;

but when backtracking kicks in for a greedy loop, we'll end up setting base.runtextpos to the last match position, and then again to that position -1, and then again to that position -2, all the way back to the beginning, which means it's not actually buying us anything in this case (in contrast, for atomic loops where there isn't backtracking, it will successfully track the last matching position). It seems like the logic should actually be more like:

if (base.runtextpos < pos)
{
    base.runtextpos = pos;
}
@ghost
Copy link

ghost commented Dec 11, 2021

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

We have an optimization that inserts an "UpdateBumpalong" node into the graph after a starting char loop:

// Optimization: unnecessary re-processing of starting loops.
// If an expression is guaranteed to begin with a single-character unbounded loop that isn't part of an alternation (in which case it
// wouldn't be guaranteed to be at the beginning) or a capture (in which case a back reference could be influenced by its length), then we
// can update the tree with a temporary node to indicate that the implementation should use that node's ending position in the input text
// as the next starting position at which to start the next match. This avoids redoing matches we've already performed, e.g. matching
// "\[email protected]" against "is this a valid [email protected]", the \w+ will initially match the "is" and then will fail to match the "@".
// Rather than bumping the scan loop by 1 and trying again to match at the "s", we can instead start at the " ". For functional correctness
// we can only consider unbounded loops, as to be able to start at the end of the loop we need the loop to have consumed all possible matches;
// otherwise, you could end up with a pattern like "a{1,3}b" matching against "aaaabc", which should match, but if we pre-emptively stop consuming
// after the first three a's and re-start from that position, we'll end up failing the match even though it should have succeeded. We can also
// apply this optimization to non-atomic loops. Even though backtracking could be necessary, such backtracking would be handled within the processing
// of a single starting position.

The code handling this node currently sets base.runtextpos to the current position, the idea being that the normal bumpalong of +1 would otherwise end up just processing the same position we've already tried. However, it looks like our current logic for handling this is flawed, in particular for non-atomic greedy loops. The current logic (in all engines) does the equivalent of:

base.runtextpos = pos;

but when backtracking kicks in for a greedy loop, we'll end up setting base.runtextpos to the last match position, and then again to that position -1, and then again to that position -2, all the way back to the beginning, which means it's not actually buying us anything in this case (in contrast, for atomic loops where there isn't backtracking, it will successfully track the last matching position). It seems like the logic should actually be more like:

if (base.runtextpos < pos)
{
    base.runtextpos = pos;
}
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

@stephentoub stephentoub self-assigned this Jan 3, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 5, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 15, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Feb 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant