-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul Regex's handling of RegexOptions.IgnoreCase #61048
Comments
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions Issue DetailsThis is related to and impacts / is impacted by #59492, but they are distinct. Today, Regex implements RegexOptions.IgnoreCase by lowercasing everything in the pattern according to the culture at the time of construction, and then at the time of match it lowercases everything in the input lazily as part of evaluating whether something matches. There are multiple issues with this, including the fact that if the cultures differ between construction and match time, this might lead to incorrect results due to differences in how things case. This leads to problems like #60753 and #36147, and it indirectly leads to complications like #36149. Those should all be addressed by properly fixing #59492. This also runs counter to the general argument that if you need to lower or upper case to determine equality, you should upper case. Beyond the functional issues, however, it also leads to non-trivial performance problems. First, it means we need to call ToLower{Invariant} for every character of the input, and potentially multiple times if there's any backtracking. But worse, any time we come across a case-insensitive character, it knocks us off our fast paths, for a variety of things. For example, we can no longer use vectorized IndexOf{Any} to search for a character, and instead need to walk character-by-character calling ToLower on each to then compare against the lowercase value. If we could instead just have a set of all of the characters that should be treated as equivalent from an ignore-casing perspective, then the rest of our optimizations based on sets should kick in. Let's say the pattern begins with 'a' and is IgnoreCase. Today we end up walking each character, calling ToLower on each, and comparing it to 'a'. If we fix this, we could instead use IndexOfAny('A', 'a'). Or even in situations where we're forced to compare character by character, we could still avoid having to call ToLower, as the set will already contain all valid. We should consider overhauling the scheme employed:
Note that the new NonBacktracking engine does something similar, which a) means we have duplicated logic with distinct approaches, and b) we have differences in behavior between the engines, in particular around handling of Turkish I. This does mean we'll need some mechanism to determine, for any given culture, what characters should be considered equivalent under IgnoreCase. In the extreme, you could imagine that for every character we upper case it, and everything that uppercased to the same value is considered to be part of the same IgnoreCase equivalence class... but that's something we'd ideally not do at run time, given the obvious overheads. We also need to consider how to efficiently handle ranges. Again, the NonBacktracking implementation has some prior art here, but it also has some things we need to address, e.g. #60753 (comment). cc: @tarekgh, @GrabYourPitchforks, @veanes, @olsaarik, @danmoseley
|
@tarekgh, @GrabYourPitchforks, @joperezr, and I spent some time talking about this and #59492. General consensus was:
|
One wrinkle we'll need to sort out here are backreferences. I believe it's the one case where we need to compare input text not against the pattern but against other input text. You can have a pattern like:
which means "match any word character and then match that same character again". And if this were with IgnoreCase, then "that same character again" can be any of the cased variants of that original character, e.g. "Aa" would match successfully. I believe this just means we'll need to consult our case folding table at match time as well, just in the case of IgnoreCase backreferences, which also means for the source generator we'll need to emit the casing table if the regex contains an IgnoreCase backreference. We might be able to be a bit smarter about it, e.g. looking at what the referenced capture contains and seeing whether it could match anything that might possibly vary in case (if the only things the capture could match are categories that participate in case conversion). |
Great catch, yeah we’ll make sure we cover that case |
This is related to and impacts / is impacted by #59492, but they are distinct.
Today, Regex implements RegexOptions.IgnoreCase by lowercasing everything in the pattern according to the culture at the time of construction, and then at the time of match it lowercases everything in the input lazily as part of evaluating whether something matches. There are multiple issues with this, including the fact that if the cultures differ between construction and match time, this might lead to incorrect results due to differences in how things case. This leads to problems like #60753 and #36147, and it indirectly leads to complications like #36149. Those should all be addressed by properly fixing #59492. This also runs counter to the general argument that if you need to lower or upper case to determine equality, you should upper case.
Beyond the functional issues, however, it also leads to non-trivial performance problems. First, it means we need to call ToLower{Invariant} for every character of the input, and potentially multiple times if there's any backtracking. But worse, any time we come across a case-insensitive character, it knocks us off our fast paths, for a variety of things. For example, we can no longer use vectorized IndexOf{Any} to search for a character, and instead need to walk character-by-character calling ToLower on each to then compare against the lowercase value. If we could instead just have a set of all of the characters that should be treated as equivalent from an ignore-casing perspective, then the rest of our optimizations based on sets should kick in. Let's say the pattern begins with 'a' and is IgnoreCase. Today we end up walking each character, calling ToLower on each, and comparing it to 'a'. If we fix this, we could instead use IndexOfAny('A', 'a'). Or even in situations where we're forced to compare character by character, we could still avoid having to call ToLower, as the set will already contain all valid, which also means we can avoid having to query CultureInfo.CurrentCulture.
We should consider overhauling the scheme employed:
Note that the new NonBacktracking engine does something similar, which a) means we have duplicated logic with distinct approaches, and b) we have differences in behavior between the engines, in particular around handling of Turkish I.
This does mean we'll need some mechanism to determine, for any given culture, what characters should be considered equivalent under IgnoreCase. In the extreme, you could imagine that for every character we upper case it, and everything that uppercased to the same value is considered to be part of the same IgnoreCase equivalence class... but that's something we'd ideally not do at run time, given the obvious overheads. We also need to consider how to efficiently handle ranges. Again, the NonBacktracking implementation has some prior art here, but it also has some things we need to address, e.g. #60753 (comment).
cc: @tarekgh, @GrabYourPitchforks, @veanes, @olsaarik, @danmoseley
The text was updated successfully, but these errors were encountered: