-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent contains-PrefixRange optimization if not preceded by wildcards #10947
Prevent contains-PrefixRange optimization if not preceded by wildcards #10947
Conversation
Prevent '^[0-9]{n}' from being processed as StartsWith optimization Fixes NVIDIA#10928 Signed-off-by: Gera Shegalov <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, i see. ^[0-9]{n}
is not a "prefix range" pattern, but ^.*[0-9]{n}
is. And stripLeadingWildcards(astLs)
will strip the ^
no matter if it is follows a .*
. Patterns like ^.*[0-9]{n}
won't be rewritten after the fix, but I think it's fine.
Thanks for fixing this!
UT failed, can update to: test("regex rewrite contains") {
import RegexOptimizationType._
val patterns = Seq(".*abc.*", ".*(abc).*", "^.*(abc).*$", "^.*(.*)(abc).*.*",
raw".*\w.*\Z", raw".*..*\Z")
- val excepted = Seq(Contains("abc"), Contains("abc"), NoOptimization, Contains("abc"),
+ val excepted = Seq(Contains("abc"), Contains("abc"), NoOptimization, NoOptimization,
NoOptimization, NoOptimization)
verifyRewritePattern(patterns, excepted)
}
test("regex rewrite prefix range") {
import RegexOptimizationType._
val patterns = Seq(
"(.*)abc[0-9]{1,3}(.*)",
"(.*)abc[0-9a-z]{1,3}(.*)",
"(.*)abc[0-9]{2}.*",
"^abc[0-9]{1,3}",
"火花急流[\u4e00-\u9fa5]{1}")
val excepted = Seq(PrefixRange("abc", 1, 48, 57),
NoOptimization,
PrefixRange("abc", 2, 48, 57),
- PrefixRange("abc", 1, 48, 57),
+ NoOptimization,
PrefixRange("火花急流", 1, 19968, 40869))
verifyRewritePattern(patterns, excepted)
} |
Signed-off-by: Gera Shegalov <[email protected]>
Thanks for the pointer, I accommodated this case. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Added a followup item to support PrefIxRange rewrite startsWith |
@thirtiseven Just a thought because I am not sure whether it is a high priority for our users: it seems like the current kernel is not hard to generalize for multiple ranges in a regex such as |
Yes, it's not hard to do, added to the epic issue, thanks! I see some use cases from users, but the priority does not seem very high. |
// TODO startsWith with PrefIxRange | ||
RegexOptimizationType.NoOptimization | ||
} else { | ||
matchSimplePattern(astTail) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For slightly better efficiency I should probably recurse here with noWildCards
but there is no correctness issue. We can fix it in a follow-up PR.
Prevent '^[0-9]{n}' from being processed as
spark_rapids_jni::literal_range_pattern
that currently only supports "contains", not "starts with"Fixes #10928
Also adding missing tailrec annotations to recursive parser methods.
Signed-off-by: Gera Shegalov [email protected]