Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent contains-PrefixRange optimization if not preceded by wildcards #10947

Merged
merged 2 commits into from
May 31, 2024

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented May 30, 2024

Prevent '^[0-9]{n}' from being processed as spark_rapids_jni::literal_range_pattern that currently only supports "contains", not "starts with"

Fixes #10928

Also adding missing tailrec annotations to recursive parser methods.

Signed-off-by: Gera Shegalov [email protected]

Prevent '^[0-9]{n}' from being processed as StartsWith optimization

Fixes NVIDIA#10928

Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov gerashegalov self-assigned this May 30, 2024
@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov added bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels May 30, 2024
thirtiseven
thirtiseven previously approved these changes May 30, 2024
Copy link
Collaborator

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, i see. ^[0-9]{n} is not a "prefix range" pattern, but ^.*[0-9]{n} is. And stripLeadingWildcards(astLs) will strip the ^ no matter if it is follows a .*. Patterns like ^.*[0-9]{n} won't be rewritten after the fix, but I think it's fine.

Thanks for fixing this!

@thirtiseven
Copy link
Collaborator

UT failed, can update to:

  test("regex rewrite contains") {
    import RegexOptimizationType._
    val patterns = Seq(".*abc.*", ".*(abc).*", "^.*(abc).*$", "^.*(.*)(abc).*.*", 
        raw".*\w.*\Z", raw".*..*\Z")
-    val excepted = Seq(Contains("abc"), Contains("abc"), NoOptimization, Contains("abc"), 
+    val excepted = Seq(Contains("abc"), Contains("abc"), NoOptimization, NoOptimization, 
        NoOptimization, NoOptimization)
    verifyRewritePattern(patterns, excepted)
  }

  test("regex rewrite prefix range") {
    import RegexOptimizationType._
    val patterns = Seq(
      "(.*)abc[0-9]{1,3}(.*)",
      "(.*)abc[0-9a-z]{1,3}(.*)",
      "(.*)abc[0-9]{2}.*",
      "^abc[0-9]{1,3}",
      "火花急流[\u4e00-\u9fa5]{1}")
    val excepted = Seq(PrefixRange("abc", 1, 48, 57),
      NoOptimization,
      PrefixRange("abc", 2, 48, 57),
-      PrefixRange("abc", 1, 48, 57),
+      NoOptimization,
      PrefixRange("火花急流", 1, 19968, 40869))
    verifyRewritePattern(patterns, excepted)
  }

Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov gerashegalov changed the title Leave non-literal prefix range regex untouched Prevent contains-PrefixRange optimization if not preceded by wildcards May 30, 2024
@gerashegalov
Copy link
Collaborator Author

Patterns like ^.*[0-9]{n} won't be rewritten after the fix, but I think it's fine.

Thanks for the pointer, I accommodated this case.

@gerashegalov
Copy link
Collaborator Author

build

Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thirtiseven
Copy link
Collaborator

thirtiseven commented May 31, 2024

Added a followup item to support PrefIxRange rewrite startsWith ^ in #10741

@gerashegalov
Copy link
Collaborator Author

Added a followup item to support PrefIxRange rewrite startsWith ^ in #10741

@thirtiseven Just a thought because I am not sure whether it is a high priority for our users: it seems like the current kernel is not hard to generalize for multiple ranges in a regex such as prefix[a-zA-Z0-9]

@thirtiseven
Copy link
Collaborator

Added a followup item to support PrefIxRange rewrite startsWith ^ in #10741

@thirtiseven Just a thought because I am not sure whether it is a high priority for our users: it seems like the current kernel is not hard to generalize for multiple ranges in a regex such as prefix[a-zA-Z0-9]

Yes, it's not hard to do, added to the epic issue, thanks! I see some use cases from users, but the priority does not seem very high.

// TODO startsWith with PrefIxRange
RegexOptimizationType.NoOptimization
} else {
matchSimplePattern(astTail)
Copy link
Collaborator Author

@gerashegalov gerashegalov May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For slightly better efficiency I should probably recurse here with noWildCards but there is no correctness issue. We can fix it in a follow-up PR.

@gerashegalov gerashegalov merged commit 022fdd1 into NVIDIA:branch-24.06 May 31, 2024
44 checks passed
@gerashegalov gerashegalov deleted the 10928-statsWithLiteral branch May 31, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
3 participants