Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special-case alternation with anchors in Regex analysis and code gen #64097

Open
stephentoub opened this issue Jan 21, 2022 · 3 comments
Open

Comments

@stephentoub
Copy link
Member

From looking at our corpus of real-world regex patterns, it looks reasonably common for developers to write patterns like:

\d{5}$|\d{5}-\d{4}$

with the same end anchor at the end of each branch. In such cases, we could refactor this into:

(?:\d{5}|\d{5}-\d{4})$

making the $ anchor then available to an optimization like #62697. For this particular pattern, we could also augment our optimization that factors out common prefixes from branches, and turn it into:

\d{5}(?:|-\d{4})$

It also appears to be reasonably common to look for something at the beginning or end of the input, e.g.

^\d{15}|\d{18}$

Our FindFirstChar optimizations don't help with such a construct because of the alternation, but we could special-case such a pattern in code generation. For example, if we restricted it to just patterns rooted in an alternation with just two branches, one with a beginning anchor and one with an end anchor, and we can compute a fixed length for the second branch, we could generate code along the lines of:

FindFirstChar() => return true iff at the beginning;
Go()
{
   // Try to match first branch.  If it succeeds, match.
   // Otherwise, jump to and update bumpalong to input.Length - ComputedFixedLength(secondBranch).
   // Try to match second branch.  If it succeeds, match. Else, fail.
}

This could be generalized to alternations of more than two branches, as long as every branch is rooted with an anchor, just iterating through each branch, jumping to either the beginning for a beginning anchor or to end - fixed length for an ending anchor, and running the match.

@ghost
Copy link

ghost commented Jan 21, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

From looking at our corpus of real-world regex patterns, it looks reasonably common for developers to write patterns like:

\d{5}$|\d{5}-\d{4}$

with the same end anchor at the end of each branch. In such cases, we could refactor this into:

(?:\d{5}|\d{5}-\d{4})$

making the $ anchor then available to an optimization like #62697. For this particular pattern, we could also augment our optimization that factors out common prefixes from branches, and turn it into:

\d{5}(?:|-\d{4})$

It also appears to be reasonably common to look for something at the beginning or end of the input, e.g.

^\d{15}|\d{18}$

Our FindFirstChar optimizations don't help with such a construct because of the alternation, but we could special-case such a pattern in code generation. For example, if we restricted it to just patterns rooted in an alternation with just two branches, one with a beginning anchor and one with an end anchor, and we can compute a fixed length for the second branch, we could generate code along the lines of:

FindFirstChar() => return true iff at the beginning;
Go()
{
   // Try to match first branch.  If it succeeds, match.
   // Otherwise, jump to and update bumpalong to input.Length - ComputedFixedLength(secondBranch).
   // Try to match second branch.  If it succeeds, match. Else, fail.
}

This could be generalized to alternations of more than two branches, as long as every branch is rooted with an anchor, just iterating through each branch, jumping to either the beginning for a beginning anchor or to end - fixed length for an ending anchor, and running the match.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jan 21, 2022
@joperezr joperezr removed the untriaged New issue has not been triaged by the area owner label Jan 21, 2022
@joperezr
Copy link
Member

\d{5}$|\d{5}-\d{4}$

Do we have a Go optimization (in the source generator) for patterns like this so that we detect that we could easily eliminate branches by just checking one character in the input where both branches are different? For example, in this particular case we could easily discard (and avoid backtracking) the second branch with a check like if (input.Length < 10 || input[input.Length-5] != '-').

@stephentoub stephentoub self-assigned this Jan 21, 2022
@stephentoub
Copy link
Member Author

Do we have a Go optimization (in the source generator) for patterns like this so that we detect that we could easily eliminate branches by just checking one character in the input where both branches are different? For example, in this particular case we could easily discard (and avoid backtracking) the second branch with a check like if (input.Length < 10 || input[input.Length-5] != '-').

If we were to add the transformation I mentioned that factored:

\d{5}$|\d{5}-\d{4}$

into

\d{5}(?:|-\d{4})$

then it does basically what you suggested (though I'm not sure what you mean about eliminating backtracking... it still needs to try with the empty branch and then if that fails try with the dash). Here's what the source generator produces today for the relevant portion of that latter expression:

                    // Match a Unicode digit exactly 5 times.
                    {
                        if ((uint)slice.Length < 5 ||
                            !char.IsDigit(slice[0]) ||
                            !char.IsDigit(slice[1]) ||
                            !char.IsDigit(slice[2]) ||
                            !char.IsDigit(slice[3]) ||
                            !char.IsDigit(slice[4]))
                        {
                            goto NoMatch;
                        }
                    }
                    
                    // Match with 2 alternative expressions.
                    //{
                        int alternation_starting_pos = pos;
                        
                        // Branch 0
                        //{
                            
                            StackPush2(ref base.runstack!, ref stackpos, 0, alternation_starting_pos);
                            pos += 5;
                            slice = slice.Slice(5);
                            goto AlternationMatch;
                            
                            AlternationBranch:
                            pos = alternation_starting_pos;
                            slice = inputSpan.Slice(pos, end - pos);
                        //}
                        
                        // Branch 1
                        //{
                            if ((uint)slice.Length < 10 ||
                                slice[5] != '-' || // Match '-'.
                                !char.IsDigit(slice[6]) || // Match a Unicode digit exactly 4 times.
                                !char.IsDigit(slice[7]) ||
                                !char.IsDigit(slice[8]) ||
                                !char.IsDigit(slice[9]))
                            {
                                goto NoMatch;
                            }
                            
                            StackPush2(ref base.runstack!, ref stackpos, 1, alternation_starting_pos);
                            pos += 10;
                            slice = slice.Slice(10);
                            goto AlternationMatch;
                        //}
                        
                        AlternationBacktrack:
                        alternation_starting_pos = base.runstack![--stackpos];
                        switch (base.runstack![--stackpos])
                        {
                            case 0: goto AlternationBranch;
                            case 1: goto NoMatch;
                        }
                        
                        AlternationMatch:;
                    //}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants