-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special-case alternation with anchors in Regex analysis and code gen #64097
Comments
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsFrom looking at our corpus of real-world regex patterns, it looks reasonably common for developers to write patterns like:
with the same end anchor at the end of each branch. In such cases, we could refactor this into:
making the
It also appears to be reasonably common to look for something at the beginning or end of the input, e.g.
Our FindFirstChar optimizations don't help with such a construct because of the alternation, but we could special-case such a pattern in code generation. For example, if we restricted it to just patterns rooted in an alternation with just two branches, one with a beginning anchor and one with an end anchor, and we can compute a fixed length for the second branch, we could generate code along the lines of: FindFirstChar() => return true iff at the beginning;
Go()
{
// Try to match first branch. If it succeeds, match.
// Otherwise, jump to and update bumpalong to input.Length - ComputedFixedLength(secondBranch).
// Try to match second branch. If it succeeds, match. Else, fail.
} This could be generalized to alternations of more than two branches, as long as every branch is rooted with an anchor, just iterating through each branch, jumping to either the beginning for a beginning anchor or to end - fixed length for an ending anchor, and running the match.
|
Do we have a Go optimization (in the source generator) for patterns like this so that we detect that we could easily eliminate branches by just checking one character in the input where both branches are different? For example, in this particular case we could easily discard (and avoid backtracking) the second branch with a check like |
If we were to add the transformation I mentioned that factored:
into
then it does basically what you suggested (though I'm not sure what you mean about eliminating backtracking... it still needs to try with the empty branch and then if that fails try with the dash). Here's what the source generator produces today for the relevant portion of that latter expression: // Match a Unicode digit exactly 5 times.
{
if ((uint)slice.Length < 5 ||
!char.IsDigit(slice[0]) ||
!char.IsDigit(slice[1]) ||
!char.IsDigit(slice[2]) ||
!char.IsDigit(slice[3]) ||
!char.IsDigit(slice[4]))
{
goto NoMatch;
}
}
// Match with 2 alternative expressions.
//{
int alternation_starting_pos = pos;
// Branch 0
//{
StackPush2(ref base.runstack!, ref stackpos, 0, alternation_starting_pos);
pos += 5;
slice = slice.Slice(5);
goto AlternationMatch;
AlternationBranch:
pos = alternation_starting_pos;
slice = inputSpan.Slice(pos, end - pos);
//}
// Branch 1
//{
if ((uint)slice.Length < 10 ||
slice[5] != '-' || // Match '-'.
!char.IsDigit(slice[6]) || // Match a Unicode digit exactly 4 times.
!char.IsDigit(slice[7]) ||
!char.IsDigit(slice[8]) ||
!char.IsDigit(slice[9]))
{
goto NoMatch;
}
StackPush2(ref base.runstack!, ref stackpos, 1, alternation_starting_pos);
pos += 10;
slice = slice.Slice(10);
goto AlternationMatch;
//}
AlternationBacktrack:
alternation_starting_pos = base.runstack![--stackpos];
switch (base.runstack![--stackpos])
{
case 0: goto AlternationBranch;
case 1: goto NoMatch;
}
AlternationMatch:;
//} |
From looking at our corpus of real-world regex patterns, it looks reasonably common for developers to write patterns like:
with the same end anchor at the end of each branch. In such cases, we could refactor this into:
making the
$
anchor then available to an optimization like #62697. For this particular pattern, we could also augment our optimization that factors out common prefixes from branches, and turn it into:It also appears to be reasonably common to look for something at the beginning or end of the input, e.g.
Our FindFirstChar optimizations don't help with such a construct because of the alternation, but we could special-case such a pattern in code generation. For example, if we restricted it to just patterns rooted in an alternation with just two branches, one with a beginning anchor and one with an end anchor, and we can compute a fixed length for the second branch, we could generate code along the lines of:
This could be generalized to alternations of more than two branches, as long as every branch is rooted with an anchor, just iterating through each branch, jumping to either the beginning for a beginning anchor or to end - fixed length for an ending anchor, and running the match.
The text was updated successfully, but these errors were encountered: