You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Handling captures involves additional code, e.g. uncapturing after a failed match attempt. For both RegexOptions.Compiled and the source generator, we output code to both track where we need to uncapture to and to then perform that uncapture if it's possible there may be captures involved. But today the analysis for this is relatively limited: we mark each node in the tree as to whether it contains captures, and that allows us to avoid the boilerplate uncapture code after failures in child nodes. But we don't analyze/track whether any nodes after a given node involve captures, and that's relevant for backtracking, e.g. today for the expression (a*)b*[bc] the codegen for that b* loop will involve the uncapture boilerplate because it'll see that the whole expression contains captures, even though if backtracking occurs due to the [bc] not matching, there is provably nothing that will need to be uncaptured. We should change how we annotate the tree and include this idea of "is a node followed by a capture". This could be done, for example, by walking the tree backwards, tracking whether we've seen a capture, and if we have, marking every future node we see.
We have a similar issue for atomic. Code gen can be much better for a variety of constructs if they're atomic. Today we walk up the tree from a node to see if anything in its parent hierarchy makes it atomic, but worst case that could be an O(N^2) algorithm (as part of construction, not matching). We could instead do a single O(N) pass over the tree to compute this for every node.
The text was updated successfully, but these errors were encountered:
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.
Issue Details
Handling captures involves additional code, e.g. uncapturing after a failed match attempt. For both RegexOptions.Compiled and the source generator, we output code to both track where we need to uncapture to and to then perform that uncapture if it's possible there may be captures involved. But today the analysis for this is relatively limited: we mark each node in the tree as to whether it contains captures, and that allows us to avoid the boilerplate uncapture code after failures in child nodes. But we don't analyze/track whether any nodes after a given node involve captures, and that's relevant for backtracking, e.g. today for the expression (a*)b*[bc] the codegen for that b* loop will involve the uncapture boilerplate because it'll see that the whole expression contains captures, even though if backtracking occurs due to the [bc] not matching, there is provably nothing that will need to be uncaptured. We should change how we annotate the tree and include this idea of "is a node followed by a capture". This could be done, for example, by walking the tree backwards, tracking whether we've seen a capture, and if we have, marking every future node we see.
We have a similar issue for atomic. Code gen can be much better for a variety of constructs if they're atomic. Today we walk up the tree from a node to see if anything in its parent hierarchy makes it atomic, but worst case that could be an O(N^2) algorithm (as part of construction, not matching). We could instead do a single O(N) pass over the tree to compute this for every node.
Handling captures involves additional code, e.g. uncapturing after a failed match attempt. For both RegexOptions.Compiled and the source generator, we output code to both track where we need to uncapture to and to then perform that uncapture if it's possible there may be captures involved. But today the analysis for this is relatively limited: we mark each node in the tree as to whether it contains captures, and that allows us to avoid the boilerplate uncapture code after failures in child nodes. But we don't analyze/track whether any nodes after a given node involve captures, and that's relevant for backtracking, e.g. today for the expression
(a*)b*[bc]
the codegen for thatb*
loop will involve the uncapture boilerplate because it'll see that the whole expression contains captures, even though if backtracking occurs due to the[bc]
not matching, there is provably nothing that will need to be uncaptured. We should change how we annotate the tree and include this idea of "is a node followed by a capture". This could be done, for example, by walking the tree backwards, tracking whether we've seen a capture, and if we have, marking every future node we see.We have a similar issue for atomic. Code gen can be much better for a variety of constructs if they're atomic. Today we walk up the tree from a node to see if anything in its parent hierarchy makes it atomic, but worst case that could be an O(N^2) algorithm (as part of construction, not matching). We could instead do a single O(N) pass over the tree to compute this for every node.
The text was updated successfully, but these errors were encountered: