-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finish "simplified" regex codegen strategy #61902
Comments
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions Issue DetailsPrior to .NET 5, both the RegexInterpreter and the RegexCompiler worked on the same strategy: RegexWriter would transform the RegexNode tree into a series of opcodes and operands, which would then be processed. RegexInterpreter would process them at match time, and RegexCompiler would emit equivalent code for them, except with various constructs hardcoded, evaluating at compile-time what would otherwise be evaluated at match-time. This included various optimizations, like unrolling loops. However, it also a) left a non-trivial amount of optimization opportunity on the table, and b) results in difficult to follow code; the crux of the issue is RegexCompiler still following the same plan with many of the same limitations as RegexInterpreter, just slightly optimized by avoiding various kinds of branches at match time. This is all the more impactful with the source generator added in .NET 7, which spits out C# code intended to be human-readable/debuggable/etc. To help with this, in .NET 5 we started incrementally added a new code generation strategy for the matching engine. Rather than being based on the opcodes output by RegexWriter, it instead outputs code based on walking the original RegexNode tree, resulting in baking the structure of the processing into the code rather than relying on the structure being pushed and popped onto the runstack/runtrack arrays used by RegexInterpreter (and RegexCompiler prior). While it was originally added to aid in performance, its impact is an order of magnitude larger for readability for the source generator in .NET 7, as it results in code much closer to that which a dev would actually write if they were manually implementing the same functionality. However, we only provided a subset of support in .NET 5. Based on our corpus of "real-world" regexes, the .NET 5 implementation was able to support ~40% of regexes, which those using the new strategy, and the remainder falling back to the old strategy. We've been picking away at this, and as of today, ~70% of regexes are now covered. There are only a few remaining categories of work remaining to address the remaining gap:
We're on track to have all but the last three (RightToLeft and lookbehinds) implemented for .NET 7. Today, lookbehinds are implemented on top of RegexOptions.RightToLeft support, so it's difficult to properly implement lookbehinds without RightToLeft, and RightToLeft is a large undertaking. However, in our corpus of "real-world" regexes, only 0.6% have RegexOptions.RightToLeft defined at the top-level, and only 3.3% involve RegexOptions.RightToLeft anywhere (meaning they use either RightToLeft or lookbehinds, which are based on RightToLeft), and only 1.4% involve RegexOptions.RightToLeft and RegexOptions.Compiled. So, it would be reasonable in the name of maintainability to say that we only support RightToLeft and lookbehinds in the interpreter, such that if you use either of those, even if you specify RegexOptions.Compiled (i.e. you're in that 1.4% of expressions), you'll get the interpreter, and if you use the source generator, we'd spit out the fallback path that just emits a cached
|
As discussed offline, I support this strategy. Although a quite small subset of patterns will likely see a performance impact, the effect on maintenance costs will be very substantial, and will likely open up other opportunities for optimization that aren't feasible today. |
cc @joperezr for his take. |
@stephentoub briefly mentioned RightToLeft case when we initially chatted, I agree and think it makes sense to have the limitation of only supporting it using Interpreted for some cases when on RightToLeft. |
Prior to .NET 5, both the RegexInterpreter and the RegexCompiler worked on the same strategy: RegexWriter would transform the RegexNode tree into a series of opcodes and operands, which would then be processed. RegexInterpreter would process them at match time, and RegexCompiler would emit equivalent code for them, except with various constructs hardcoded, evaluating at compile-time what would otherwise be evaluated at match-time. This included various optimizations, like unrolling loops. However, it also a) left a non-trivial amount of optimization opportunity on the table, and b) results in difficult to follow code; the crux of the issue is RegexCompiler still following the same plan with many of the same limitations as RegexInterpreter, just slightly optimized by avoiding various kinds of branches at match time. This is all the more impactful with the source generator added in .NET 7, which spits out C# code intended to be human-readable/debuggable/etc.
To help with this, in .NET 5 we started incrementally added a new code generation strategy for the matching engine. Rather than being based on the opcodes output by RegexWriter, it instead outputs code based on walking the original RegexNode tree, resulting in baking the structure of the processing into the code rather than relying on the structure being pushed and popped onto the runstack/runtrack arrays used by RegexInterpreter (and RegexCompiler prior). While it was originally added to aid in performance, its impact is an order of magnitude larger for readability for the source generator in .NET 7, as it results in code much closer to that which a dev would actually write if they were manually implementing the same functionality.
However, we only provided a subset of support in .NET 5. Based on our corpus of "real-world" regexes, the .NET 5 implementation was able to support ~40% of regexes, which those using the new strategy, and the remainder falling back to the old strategy. We've been picking away at this, and as of today, ~70% of regexes are now covered. There are only a few remaining categories of work remaining to address the remaining gap:
We're on track to have all but the last two (RightToLeft and lookbehinds) implemented for .NET 7. Today, lookbehinds are implemented on top of RegexOptions.RightToLeft support, so it's difficult to properly implement lookbehinds without RightToLeft, and RightToLeft is a large undertaking.
However, in our corpus of "real-world" regexes, only 0.6% have RegexOptions.RightToLeft defined at the top-level, and only 3.3% involve RegexOptions.RightToLeft anywhere (meaning they use either RightToLeft or lookbehinds, which are based on RightToLeft), and only 1.4% involve RegexOptions.RightToLeft and RegexOptions.Compiled. So, it would be reasonable in the name of maintainability to say that we only support RightToLeft and lookbehinds in the interpreter, such that if you use either of those, even if you specify RegexOptions.Compiled (i.e. you're in that 1.4% of expressions), you'll get the interpreter, and if you use the source generator, we'd spit out the fallback path that just emits a cached
new Regex(...)
instance. This might represent a perf regression for that very small subset of use cases. However, a big benefit would be not having to fully implement RightToLeft in the new codegen strategy, getting to delete several thousand lines of difficult code from RegexCompiler and several thousand lines of difficult code from the source generator, and we could also stop running RegexWriter at all for RegexCompiler and the source generator. And we could mark various protected members of RegexRunner as obsolete and consider removing them in the future.The text was updated successfully, but these errors were encountered: