Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Regex's knowledge of captures and atomicity in the tree #62451

Closed
stephentoub opened this issue Dec 6, 2021 · 1 comment · Fixed by #65734
Closed

Improve Regex's knowledge of captures and atomicity in the tree #62451

stephentoub opened this issue Dec 6, 2021 · 1 comment · Fixed by #65734
Assignees
Milestone

Comments

@stephentoub
Copy link
Member

Handling captures involves additional code, e.g. uncapturing after a failed match attempt. For both RegexOptions.Compiled and the source generator, we output code to both track where we need to uncapture to and to then perform that uncapture if it's possible there may be captures involved. But today the analysis for this is relatively limited: we mark each node in the tree as to whether it contains captures, and that allows us to avoid the boilerplate uncapture code after failures in child nodes. But we don't analyze/track whether any nodes after a given node involve captures, and that's relevant for backtracking, e.g. today for the expression (a*)b*[bc] the codegen for that b* loop will involve the uncapture boilerplate because it'll see that the whole expression contains captures, even though if backtracking occurs due to the [bc] not matching, there is provably nothing that will need to be uncaptured. We should change how we annotate the tree and include this idea of "is a node followed by a capture". This could be done, for example, by walking the tree backwards, tracking whether we've seen a capture, and if we have, marking every future node we see.

We have a similar issue for atomic. Code gen can be much better for a variety of constructs if they're atomic. Today we walk up the tree from a node to see if anything in its parent hierarchy makes it atomic, but worst case that could be an O(N^2) algorithm (as part of construction, not matching). We could instead do a single O(N) pass over the tree to compute this for every node.

@stephentoub stephentoub added this to the 7.0.0 milestone Dec 6, 2021
@ghost
Copy link

ghost commented Dec 6, 2021

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Handling captures involves additional code, e.g. uncapturing after a failed match attempt. For both RegexOptions.Compiled and the source generator, we output code to both track where we need to uncapture to and to then perform that uncapture if it's possible there may be captures involved. But today the analysis for this is relatively limited: we mark each node in the tree as to whether it contains captures, and that allows us to avoid the boilerplate uncapture code after failures in child nodes. But we don't analyze/track whether any nodes after a given node involve captures, and that's relevant for backtracking, e.g. today for the expression (a*)b*[bc] the codegen for that b* loop will involve the uncapture boilerplate because it'll see that the whole expression contains captures, even though if backtracking occurs due to the [bc] not matching, there is provably nothing that will need to be uncaptured. We should change how we annotate the tree and include this idea of "is a node followed by a capture". This could be done, for example, by walking the tree backwards, tracking whether we've seen a capture, and if we have, marking every future node we see.

We have a similar issue for atomic. Code gen can be much better for a variety of constructs if they're atomic. Today we walk up the tree from a node to see if anything in its parent hierarchy makes it atomic, but worst case that could be an O(N^2) algorithm (as part of construction, not matching). We could instead do a single O(N) pass over the tree to compute this for every node.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Dec 6, 2021
@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 11, 2022
@stephentoub stephentoub self-assigned this Jan 12, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 22, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 25, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants