-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex optimization opportunities #1349
Comments
cc: @pgovind |
We use Boyer-Moore or whatever to find the prefix, then start the engine from that point. If the regex had a literal suffix, which was not in an alternation or anything of that sort, could we (a) verify that literal suffix exists and/or (b) avoid greedily consuming any of the suffix? It appears that Rust searches for literal suffixes in some circumstances. |
Yes, though it doesn't need to just be a suffix; it could be any character or characters guaranteed to be in any matching string. However, it's not clear to me that's always going to be a net win. The benefit is that we could avoid checking more complicated parts of the pattern if that check failed, but such a check also isn't free. It's easier to justify if there are no variable length regions before the substring being searched for, as then you can directly index to look for it at the right offset, but at that point you're also validating the pattern, and you might be better off just reorganizing how the go method implementation works to check for that first, and also do the work item(s) I've mentioned of decreasing the overhead of switching back and forth between FindFirstChar and Go.
I'm not sure what you mean by this. |
Here are 3 regexes we have seen be high cost in our own services, with your notes @stephentoub :
|
We've implemented most of this, some of these aren't worth pursuing, we have existing issues for most of the rest, and I've just opened a few remaining issues. As such, I'm going to close this. |
We force include the default constructor of a type whenever an EEType is created mostly to make sure `Activator.CreateInstance<T>` is going to work in dynamically created types. I think we should be able to get by without it, but we should at least start by not including the method as a reflectable method. IL Linker doesn't guarantee `CreateInstance<T>` is going to work unless dataflow annotations match (CreateInstance annoying doesn't have a `new()` constraint but IL Linker requires that whatever is used in place of the T is either a concrete type or a T that has `new()` or a `DynamicallyAccessedMembers.PublicParameterlessConstructor` annotation).
We’ve made some very measurable improvement to Regex post-.NET Core 3.1 (e.g. #271, #449, #542, #1348), but there’s still a lot of opportunity for incremental improvement (we can also consider a more foundational re-implementation if desired). Having spent some time over the holidays looking at the implementation, here’s a brain dump of some ideas that can be explored:
Vectorization / faster searching:
\bsomeword\b
, but the moment RegexFCD.Prefix sees that\b
boundary, it gives up.[abc]def
, we could still use Boyer-Moore with "def" and then back up by the offset for the actual matching engine.(hi|hello) there
, we’ll do a vectorized search forh
, but given[\w]+
, we’ll iterate through the characters one-by-one doing a bit vector lookup on each character to see whether it’s a word character. We might be able to take advantage of some of the newer SSE/AVX instrinsics to do this faster.hello
is matched with a series of generated comparisons, one for each character. In contrast, for repeaters, we code gen an actual loop. We should explore in what situations it might be beneficial to unroll (partially) a repeater as well, or conversely, whether there are cases where we’re better off generating a loop for strings rather than unrolling.'n' << 16 | 'o'
. We can do this with Int64 or even SIMD vectors.Tree rewriting:
(?:this|that) is great
. This could be automatically rewritten asth(?:is|at) is great
, which has several potential benefits: a) it exposes a prefix that the Boyer-Moore matching will see and be able to use in FindFirstChar, b) it reduces the amount of backtracking that’ll be necessary (backtracking to after rather than to before the "th"), and c) the string might be able to partake in other optimizations, such as being used to convert earlier loops into atomic ones.'[^']*'
which finds a pair of single quotes and everything between them. We’ll detect that the[^']*
can’t possibly benefit from backtracking, because it’ll never match the character that’s required after this loop. However, if we had that exact same loop as part of an alternation, e.g.('[^']*|something)'
, we won’t detect that. Currently the implementation only does two checks: it looks at successor nodes in a concatenation, and it looks at the last node in the regex that has no successors. Obviously there are plenty of other cases where analysis can quickly prove upgrading to an atomic grouping is correct. (Note that other implementations call this auto-possessification, named after a possessive quantifier. .NET doesn’t have a possessive quantifier, but a possessive quantifier is just shorthand for an atomic group, which .NET does have.)(Hello|Hi), .*\. How are you\?
If we feed this a string likeHello, steve. How are ya?
, it’ll match theHello,
, but fail to match the rest… at that point, it’ll then start over and try withHi
, but there’s no point in doing so, as ifHello
matched,Hi
can’t possibly. We can detect this statically from the regex pattern, and then if every branch N has a required prefix that’s not a substring of every < N’s prefix, make the alternation atomic, e.g. wrap it in(?> … )
..*
. If the RegexNode.FinalOptimize step sees that the Regex begins with a star loop (or .+), as long as it’s not part of an explicit capture or alternation, it can prepend a ^ anchor. That can avoid a ton of work as FindFirstChar will then return false on a subsequent call rather than bumping by 1 and going again. It might be easier to do this in the parser...*
and we combine that to be.+
, we can then more easily see that the expression begins with a star loop that lets us automatically anchor it.[ab]*[cd]
, we'll see that there's no overlap between what can be matched by the loop and what comes next, and we'll make the loop atomic. We would similarly do that if the expression were[ab]*[cd]+
, because we see that what comes after the first loop must occur at least once. However, if the expression were[ab]*[cd]*
, we won't make the first loop atomic, because we see that the second loop has a min repeat count of 0 and thus isn't guaranteed to appear. We could be more aggressive here, and look at all possible subsequent nodes. For example, here we would see that there's nothing after the second loop in the expression, and so if the second pattern does appear, we won't overlap with it, and if the second pattern doesn't appear, we're at the end, so either way we can make it atomic. Similarly, if the expression were[ab]*[cd]*[ef]
, we would make both loops atomic; for the first, we'd see that it could be followed by either[cd]
or[ef]
, and it doesn't overlap with either.Better handling of specific subexpressions:
.*somestring
. It’s pretty common to see a star loop followed by a string. Today’s implementation is very naïve: it dutifully searches for the\n
, tries to compare the string there, then backs off by 1, tries to compare the string there, then backs off 1, tries to compare the string there, and so on. We can instead use an algorithm like Aho-Corasick to search back through the star loop, or even just a vectorized LastIndexOf for the string.foo(bar|zoo)
which uses the parens to separate the alternation is actually also adding a capture (instead of the dev having writtenfoo(?:bar|zoo)
). Adding capture support to the non-backtracking codegen shouldn’t be hard, and it’ll allow more cases to switch over to that implementation.^
anchor much faster whenRegexOptions.Multiline
is used (though it's not clear how common it is to use it).Character class matching:
[^"]
. This matches every character except a quote. Today we generate an ASCII bitmap for each set, but then for inputs >= 128, if it’s possible a Unicode character could match, we codegen a fallback that calls RegexCharClass.CharInClass. However, with a negation like this one, we know that not only might some Unicode characters match, we know that every Unicode character is going to match. So, instead of generating a fallback that calls RegexCharClass.CharInClass, we can just generate a fallback that returns true. This will not only significantly speed-up the handling of Unicode characters, it’ll also result in smaller code, which should help overall performance. For example, today we end up generating code along the lines ofif ((uint)char < 128) result = lookup[char]; else result = CharInClass(char);
; for a set like this, we can instead generate something likeresult = (uint)char >= 128 || lookup[char];
.\d
. Today we generate a lookup table for ASCII, and then fallback to calling CharInClass (in actuality, '\d` has special handling). We can instead just generate a call to char.GetUnicodeCategory and compare its category to the desired const. We might not even need to generate the fast-path lookup table for this case, as GetUnicodeCategory already has a fast-path lookup table for ASCII, but we’d need to measure. Then consider a set like \w. It maps to multiple Unicode categories. Instead of having the fast-path lookup table and then a fallback that calls CharInClass, we could have the fallback be a call to UnicodeCategory.GetUnicodeCategory, and codegen the comparison against the known Unicode categories we care about.Explore using larger lookup tables than 128. For example, would doubling the size to all characters < 256 help in enough additional cases to be worthwhile?\s
, we now call out to a helper method, e.g. char.IsWhitespace. we should validate those calls are getting inlined.General overhead:
hello
will first validate the match in FindFirstChar and then redo the match in Go, even though we know it already matched. In such a case, we should be able to effectively make Go a nop.int[]
s for backtracking-related state, but if we end up employing the non-backtracking codegenerator, these are wasted. We can avoid creating them at all if they won't be used. While the runner gets reused for serial use, if a Regex instance is used concurrently (which happens implicitly when using the static Regex methods), threads that don't get the shared runner will end up creating their own temporary instance, and each of those will end up incurring these unnecessary allocations.Better bounds-check elimination. We don't currently generate code in a way that will optimally eliminate bounds checks, e.g. we're not eliminating the bounds checks on the character class strings we output for fast lookups. Since we're doing code generation and spitting out IL directly, we could choose to just use IL-based opcodes to manipulate refs directly and avoid bounds checking altogether.(I tried eliminating all bounds checking and it made very little difference.)foo[abc]{2,4}
has a minimum length of 5. FindFirstChar can check upfront to see whether the input string is at least that long, and if it’s not, immediately fail.Fundamental engine changes / replacements:
RegexOptions.Compiled
, we could consider tracking usage of theRegex
instance, and when it hits a certain threshold, asynchronously upgrade it to being compiled. We'd queue the compilation of the expression, and then backpatch the instance with the upgraded runner, such that all future invocations would use the compiled version instead. The upside to this is faster execution whenRegexOptions.Compiled
should have been specified but wasn't. There are a few potential downsides. We'd need to always carry around the compiler in a trimmed app, even if the compiler wouldn't otherwise be used. We'd potentially waste time/space compiling a regex that wasn't actually important. And we'd risk subtle discrepancies that show up over time if we ever have a bug that results in different execution semantics between non-compiled and compiled.PCRE light-up. Just as we analyze the RegexNode tree to see whether we can use our non-backtracking implementation, we could analyze it to see whether we could use PCRE, and do so if it’s dynamically found on the machine. The .NET APIs support features PCRE doesn’t, so it would only work if a subset of functionality was required.cc: @danmosemsft, @eerhardt, @ViktorHofer, @davidwrighton
The text was updated successfully, but these errors were encountered: