Special-case alternation with anchors in Regex analysis and code gen #64097

stephentoub · 2022-01-21T14:45:22Z

From looking at our corpus of real-world regex patterns, it looks reasonably common for developers to write patterns like:

\d{5}$|\d{5}-\d{4}$

with the same end anchor at the end of each branch. In such cases, we could refactor this into:

(?:\d{5}|\d{5}-\d{4})$

making the $ anchor then available to an optimization like #62697. For this particular pattern, we could also augment our optimization that factors out common prefixes from branches, and turn it into:

\d{5}(?:|-\d{4})$

It also appears to be reasonably common to look for something at the beginning or end of the input, e.g.

^\d{15}|\d{18}$

Our FindFirstChar optimizations don't help with such a construct because of the alternation, but we could special-case such a pattern in code generation. For example, if we restricted it to just patterns rooted in an alternation with just two branches, one with a beginning anchor and one with an end anchor, and we can compute a fixed length for the second branch, we could generate code along the lines of:

FindFirstChar() => return true iff at the beginning;
Go()
{
   // Try to match first branch.  If it succeeds, match.
   // Otherwise, jump to and update bumpalong to input.Length - ComputedFixedLength(secondBranch).
   // Try to match second branch.  If it succeeds, match. Else, fail.
}

This could be generalized to alternations of more than two branches, as long as every branch is rooted with an anchor, just iterating through each branch, jumping to either the beginning for a beginning anchor or to end - fixed length for an ending anchor, and running the match.

The text was updated successfully, but these errors were encountered:

ghost · 2022-01-21T14:45:25Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

From looking at our corpus of real-world regex patterns, it looks reasonably common for developers to write patterns like:

\d{5}$|\d{5}-\d{4}$

with the same end anchor at the end of each branch. In such cases, we could refactor this into:

(?:\d{5}|\d{5}-\d{4})$

making the $ anchor then available to an optimization like #62697. For this particular pattern, we could also augment our optimization that factors out common prefixes from branches, and turn it into:

\d{5}(?:|-\d{4})$

It also appears to be reasonably common to look for something at the beginning or end of the input, e.g.

^\d{15}|\d{18}$

Our FindFirstChar optimizations don't help with such a construct because of the alternation, but we could special-case such a pattern in code generation. For example, if we restricted it to just patterns rooted in an alternation with just two branches, one with a beginning anchor and one with an end anchor, and we can compute a fixed length for the second branch, we could generate code along the lines of:

FindFirstChar() => return true iff at the beginning;
Go()
{
   // Try to match first branch.  If it succeeds, match.
   // Otherwise, jump to and update bumpalong to input.Length - ComputedFixedLength(secondBranch).
   // Try to match second branch.  If it succeeds, match. Else, fail.
}

This could be generalized to alternations of more than two branches, as long as every branch is rooted with an anchor, just iterating through each branch, jumping to either the beginning for a beginning anchor or to end - fixed length for an ending anchor, and running the match.

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	7.0.0

joperezr · 2022-01-21T18:44:07Z

\d{5}$|\d{5}-\d{4}$

Do we have a Go optimization (in the source generator) for patterns like this so that we detect that we could easily eliminate branches by just checking one character in the input where both branches are different? For example, in this particular case we could easily discard (and avoid backtracking) the second branch with a check like if (input.Length < 10 || input[input.Length-5] != '-').

stephentoub · 2022-01-23T02:12:52Z

Do we have a Go optimization (in the source generator) for patterns like this so that we detect that we could easily eliminate branches by just checking one character in the input where both branches are different? For example, in this particular case we could easily discard (and avoid backtracking) the second branch with a check like if (input.Length < 10 || input[input.Length-5] != '-').

If we were to add the transformation I mentioned that factored:

\d{5}$|\d{5}-\d{4}$

into

\d{5}(?:|-\d{4})$

then it does basically what you suggested (though I'm not sure what you mean about eliminating backtracking... it still needs to try with the empty branch and then if that fails try with the dash). Here's what the source generator produces today for the relevant portion of that latter expression:

                    // Match a Unicode digit exactly 5 times.
                    {
                        if ((uint)slice.Length < 5 ||
                            !char.IsDigit(slice[0]) ||
                            !char.IsDigit(slice[1]) ||
                            !char.IsDigit(slice[2]) ||
                            !char.IsDigit(slice[3]) ||
                            !char.IsDigit(slice[4]))
                        {
                            goto NoMatch;
                        }
                    }
                    
                    // Match with 2 alternative expressions.
                    //{
                        int alternation_starting_pos = pos;
                        
                        // Branch 0
                        //{
                            
                            StackPush2(ref base.runstack!, ref stackpos, 0, alternation_starting_pos);
                            pos += 5;
                            slice = slice.Slice(5);
                            goto AlternationMatch;
                            
                            AlternationBranch:
                            pos = alternation_starting_pos;
                            slice = inputSpan.Slice(pos, end - pos);
                        //}
                        
                        // Branch 1
                        //{
                            if ((uint)slice.Length < 10 ||
                                slice[5] != '-' || // Match '-'.
                                !char.IsDigit(slice[6]) || // Match a Unicode digit exactly 4 times.
                                !char.IsDigit(slice[7]) ||
                                !char.IsDigit(slice[8]) ||
                                !char.IsDigit(slice[9]))
                            {
                                goto NoMatch;
                            }
                            
                            StackPush2(ref base.runstack!, ref stackpos, 1, alternation_starting_pos);
                            pos += 10;
                            slice = slice.Slice(10);
                            goto AlternationMatch;
                        //}
                        
                        AlternationBacktrack:
                        alternation_starting_pos = base.runstack![--stackpos];
                        switch (base.runstack![--stackpos])
                        {
                            case 0: goto AlternationBranch;
                            case 1: goto NoMatch;
                        }
                        
                        AlternationMatch:;
                    //}

stephentoub added the area-System.Text.RegularExpressions label Jan 21, 2022

stephentoub added this to the 7.0.0 milestone Jan 21, 2022

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jan 21, 2022

joperezr removed the untriaged New issue has not been triaged by the area owner label Jan 21, 2022

stephentoub self-assigned this Jan 21, 2022

stephentoub mentioned this issue Jan 24, 2022

Improve Regex handling of anchors #64177

Merged

stephentoub removed their assignment Feb 22, 2022

stephentoub mentioned this issue Mar 3, 2022

Use RegexGenerator in System.Private.Xml #66142

Merged

stephentoub modified the milestones: 7.0.0, 8.0.0 Jul 12, 2022

joperezr modified the milestones: 8.0.0, Future Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special-case alternation with anchors in Regex analysis and code gen #64097

Special-case alternation with anchors in Regex analysis and code gen #64097

stephentoub commented Jan 21, 2022

ghost commented Jan 21, 2022

joperezr commented Jan 21, 2022

stephentoub commented Jan 23, 2022

Special-case alternation with anchors in Regex analysis and code gen #64097

Special-case alternation with anchors in Regex analysis and code gen #64097

Comments

stephentoub commented Jan 21, 2022

ghost commented Jan 21, 2022

joperezr commented Jan 21, 2022

stephentoub commented Jan 23, 2022