Improve Regex's knowledge of captures and atomicity in the tree #62451

stephentoub · 2021-12-06T18:13:12Z

Handling captures involves additional code, e.g. uncapturing after a failed match attempt. For both RegexOptions.Compiled and the source generator, we output code to both track where we need to uncapture to and to then perform that uncapture if it's possible there may be captures involved. But today the analysis for this is relatively limited: we mark each node in the tree as to whether it contains captures, and that allows us to avoid the boilerplate uncapture code after failures in child nodes. But we don't analyze/track whether any nodes after a given node involve captures, and that's relevant for backtracking, e.g. today for the expression (a*)b*[bc] the codegen for that b* loop will involve the uncapture boilerplate because it'll see that the whole expression contains captures, even though if backtracking occurs due to the [bc] not matching, there is provably nothing that will need to be uncaptured. We should change how we annotate the tree and include this idea of "is a node followed by a capture". This could be done, for example, by walking the tree backwards, tracking whether we've seen a capture, and if we have, marking every future node we see.

We have a similar issue for atomic. Code gen can be much better for a variety of constructs if they're atomic. Today we walk up the tree from a node to see if anything in its parent hierarchy makes it atomic, but worst case that could be an O(N^2) algorithm (as part of construction, not matching). We could instead do a single O(N) pass over the tree to compute this for every node.

The text was updated successfully, but these errors were encountered:

ghost · 2021-12-06T18:13:15Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Handling captures involves additional code, e.g. uncapturing after a failed match attempt. For both RegexOptions.Compiled and the source generator, we output code to both track where we need to uncapture to and to then perform that uncapture if it's possible there may be captures involved. But today the analysis for this is relatively limited: we mark each node in the tree as to whether it contains captures, and that allows us to avoid the boilerplate uncapture code after failures in child nodes. But we don't analyze/track whether any nodes after a given node involve captures, and that's relevant for backtracking, e.g. today for the expression (a*)b*[bc] the codegen for that b* loop will involve the uncapture boilerplate because it'll see that the whole expression contains captures, even though if backtracking occurs due to the [bc] not matching, there is provably nothing that will need to be uncaptured. We should change how we annotate the tree and include this idea of "is a node followed by a capture". This could be done, for example, by walking the tree backwards, tracking whether we've seen a capture, and if we have, marking every future node we see.

We have a similar issue for atomic. Code gen can be much better for a variety of constructs if they're atomic. Today we walk up the tree from a node to see if anything in its parent hierarchy makes it atomic, but worst case that could be an O(N^2) algorithm (as part of construction, not matching). We could instead do a single O(N) pass over the tree to compute this for every node.

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`
Milestone:	7.0.0

stephentoub added area-System.Text.RegularExpressions tenet-performance Performance related issue labels Dec 6, 2021

stephentoub added this to the 7.0.0 milestone Dec 6, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Dec 6, 2021

joperezr mentioned this issue Dec 14, 2021

System.Text.RegularExpressions work planned for .NET 7 #62758

Closed

48 tasks

jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 11, 2022

stephentoub self-assigned this Jan 12, 2022

stephentoub mentioned this issue Feb 22, 2022

Centralize regex tree analysis for atomic/capture/backtracking detection #65734

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 22, 2022

stephentoub closed this as completed in #65734 Feb 25, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 25, 2022

ghost locked as resolved and limited conversation to collaborators Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Regex's knowledge of captures and atomicity in the tree #62451

Improve Regex's knowledge of captures and atomicity in the tree #62451

stephentoub commented Dec 6, 2021

ghost commented Dec 6, 2021

Improve Regex's knowledge of captures and atomicity in the tree #62451

Improve Regex's knowledge of captures and atomicity in the tree #62451

Comments

stephentoub commented Dec 6, 2021

ghost commented Dec 6, 2021