Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Or, And, Not + related fixes #69839

Merged
merged 7 commits into from
May 28, 2022
Merged

Remove Or, And, Not + related fixes #69839

merged 7 commits into from
May 28, 2022

Conversation

olsaarik
Copy link
Contributor

This PR includes cleanup of unused nodes and related fixes. The SymbolicRegexNodeKind elements Or, And and Not as well as any related classes/functions have been removed. This includes SymbolicRegexSet, TransitionRegex and SymbolicNFA.

OrderedOr has been renamed to Alternate, to match the name of the parse tree node. Their semantics are the same.

SaveDGML, which depended on Or, has been reworked by pulling out the exploration logic into SymbolicRegexMatcher.Explore, which does the exploration in the main transition structures of the matcher. This means that the exploration can also be used to pre-calculate all the derivatives instead of doing it on the fly, which might be useful for future optimizations (e.g. combining Compiled and NonBacktracking).

GenerateRandomMembers is now SampleMatches and has also been rewritten to use the transition logic from SymbolicRegexMatcher. Negative sampling has been removed due to not having support for Not anymore.

SymbolicRegexMatcher is now a partial class and SaveDGML, Explore, and SampleMatches are implemented in separate files as additions to the matcher.

FixedLengthMarker support had several problems that this PR fixes:

  • Logic to extract the fixed length out of an end state only supported the now-removed Or. The main logic is now in SymbolicRegexNode.ResolveFixedLength. The new function supports finding the fixed length marker on the path that the backtracking matcher would accept a match and also supports conditional nullability. For example, patterns like abc|$(4)|(3) would correctly resolve to either 4 or 3 depending on the context.
  • Logic to add fixed length markers in RegexNodeConverter didn't support the structures actually present now that we have capture start and end markers. The logic is now in a separate transformation function SymbolicRegexNode.AddFixedLengthMarkers, which made it much easier to handle the various cases. It does introduce some additional allocations, as parts of the tree may be rebuilt as markers are added.

@ghost
Copy link

ghost commented May 26, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR includes cleanup of unused nodes and related fixes. The SymbolicRegexNodeKind elements Or, And and Not as well as any related classes/functions have been removed. This includes SymbolicRegexSet, TransitionRegex and SymbolicNFA.

OrderedOr has been renamed to Alternate, to match the name of the parse tree node. Their semantics are the same.

SaveDGML, which depended on Or, has been reworked by pulling out the exploration logic into SymbolicRegexMatcher.Explore, which does the exploration in the main transition structures of the matcher. This means that the exploration can also be used to pre-calculate all the derivatives instead of doing it on the fly, which might be useful for future optimizations (e.g. combining Compiled and NonBacktracking).

GenerateRandomMembers is now SampleMatches and has also been rewritten to use the transition logic from SymbolicRegexMatcher. Negative sampling has been removed due to not having support for Not anymore.

SymbolicRegexMatcher is now a partial class and SaveDGML, Explore, and SampleMatches are implemented in separate files as additions to the matcher.

FixedLengthMarker support had several problems that this PR fixes:

  • Logic to extract the fixed length out of an end state only supported the now-removed Or. The main logic is now in SymbolicRegexNode.ResolveFixedLength. The new function supports finding the fixed length marker on the path that the backtracking matcher would accept a match and also supports conditional nullability. For example, patterns like abc|$(4)|(3) would correctly resolve to either 4 or 3 depending on the context.
  • Logic to add fixed length markers in RegexNodeConverter didn't support the structures actually present now that we have capture start and end markers. The logic is now in a separate transformation function SymbolicRegexNode.AddFixedLengthMarkers, which made it much easier to handle the various cases. It does introduce some additional allocations, as parts of the tree may be rebuilt as markers are added.
Author: olsaarik
Assignees: olsaarik
Labels:

area-System.Text.RegularExpressions

Milestone: -

@olsaarik olsaarik marked this pull request as ready for review May 26, 2022 02:57
@stephentoub
Copy link
Member

This is very pretty:
image

@@ -29,23 +29,11 @@ internal DfaMatchingState(SymbolicRegexNode<TSet> node, uint prevCharKind)
internal bool IsDeadend => Node.IsNothing;

/// <summary>The node must be nullable here</summary>
internal int FixedLength
internal int FixedLength(uint nextCharKind)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that you were able to fix the fixed-length markers. Do we know if this was contributing to some of the perf slowdowns that had been measured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely, but I don't for sure know yet. I'll measure against current main after I get this merged (to unblock Margus).

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@dotnet dotnet deleted a comment from azure-pipelines bot May 26, 2022
@dotnet dotnet deleted a comment from azure-pipelines bot May 26, 2022
olsaarik added 4 commits May 26, 2022 18:08
Reworked DGML support and sampling to work without the old Or.
The exploration step that was in the DGML code has been pulled out now.
With Not gone the sampler has some direct support for negative sampling,
but it's still buggy and complicated to do properly. Will likely remove.
Removed TransitionRegex, which needed And and Not.
Now the fixed length logic supports conditional nullability too.
@ghost ghost locked as resolved and limited conversation to collaborators Jul 12, 2022
@mrsharm
Copy link
Member

mrsharm commented Aug 10, 2022

@olsaarik - we found the following regressions that seemed to line up with this PR and as per the issues referenced by @AndyAyersMS. We detected this from our analysis while creating the perf report for August. Would you consider these regressions as "by design"?

We did notice an improvement, however, we are still not at the same level as before this change:

image

CC: @dakersnar

EDIT: Removed the benchmarks involving the Compiled parameter such as System.Text.RegularExpressions.Tests.Perf_Regex_Common.CtorInvoke(Options: IgnoreCase, Compiled) as these seem to have regressed on 5/25 and are unrelated to this PR:

Windows 10 x64 - System.Text.RegularExpressions.Tests.Perf_Regex_Common.CtorInvoke(Options: IgnoreCase, Compiled)
image

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 11, Options: NonBacktracking)

Result Ratio Alloc Delta Operating System Bit Processor Name
Slower 0.88 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Same 0.92 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Slower 0.81 +0 macOS Monterey 12.3 Arm64 Apple M1 Max
Same 0.91 +0 Windows 10 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 0.87 +0 Windows 10 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Slower 0.86 +0 Windows 10 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Slower 0.88 +0 Windows 10 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower 0.86 +0 Windows 10 X64 Intel Core i9-10900K CPU 3.70GHz
Slower 0.84 +0 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 0.85 +0 Windows 11 X64 AMD Ryzen 9 3950X
Same 0.93 +0 Windows 11 X64 AMD Ryzen 9 5900X
Same 0.94 +0 Windows 11 X64 AMD Ryzen 9 5950X
Same 0.89 +0 Windows 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.87 +0 Windows 11 X64 Intel Core i9-10900K CPU 3.70GHz
Slower 0.79 +0 Windows 11 X64 11th Gen Intel Core i9-11900H 2.50GHz
Slower 0.74 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 0.88 +0 ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)
Slower 0.62 +0 ubuntu 18.04 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.62 +0 ubuntu 20.04 X64 AMD Ryzen 9 5900X
Slower 0.78 +0 ubuntu 20.04 X64 Intel Core i9-10900K CPU 3.70GHz
Slower 0.85 +0 Windows 10 X86 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 0.78 +0 Windows 10 X86 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Slower 0.81 +0 Windows 11 X87 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 0.78 +0 macOS Big Sur 11.6.8 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower 0.81 +0 macOS Monterey 12.3.1 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Slower 0.76 +0 macOS Monterey 12.4 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 0, Options: NonBacktracking)

Result Ratio Alloc Delta Operating System Bit Processor Name
Same 0.93 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Same 0.93 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Same 0.94 +0 macOS Monterey 12.3 Arm64 Apple M1 Max
Same 0.92 +0 Windows 10 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 0.90 +0 Windows 10 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Same 0.96 +0 Windows 10 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Same 0.94 +0 Windows 10 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Same 0.94 +0 Windows 10 X64 Intel Core i9-10900K CPU 3.70GHz
Same 0.91 +0 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 0.88 +0 Windows 11 X64 AMD Ryzen 9 3950X
Same 0.98 +0 Windows 11 X64 AMD Ryzen 9 5900X
Same 1.02 +0 Windows 11 X64 AMD Ryzen 9 5950X
Same 0.96 +0 Windows 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Same 0.95 +0 Windows 11 X64 Intel Core i9-10900K CPU 3.70GHz
Same 0.91 +0 Windows 11 X64 11th Gen Intel Core i9-11900H 2.50GHz
Same 0.91 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 0.97 +0 ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)
Slower 0.76 +0 ubuntu 18.04 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.82 +0 ubuntu 20.04 X64 AMD Ryzen 9 5900X
Same 0.90 +0 ubuntu 20.04 X64 Intel Core i9-10900K CPU 3.70GHz
Same 0.89 +0 Windows 10 X86 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 0.91 +0 Windows 10 X86 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Slower 0.85 +0 Windows 11 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 0.85 +0 macOS Big Sur 11.6.8 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower 0.89 +0 macOS Monterey 12.3.1 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Same 0.91 +0 macOS Monterey 12.4 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)

@stephentoub
Copy link
Member

(While this could have affected the two NonBacktracking tests, this would not have affected the Compiled test.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants