-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate and validate regular expression complexities #6006
Estimate and validate regular expression complexities #6006
Conversation
Signed-off-by: Anthony Chang <[email protected]>
Signed-off-by: Anthony Chang <[email protected]>
…timate-regex-complexity Signed-off-by: Anthony Chang <[email protected]>
Signed-off-by: Anthony Chang <[email protected]>
Signed-off-by: Anthony Chang <[email protected]>
Signed-off-by: Anthony Chang <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
…timate-regex-complexity
Signed-off-by: Anthony Chang <[email protected]>
Signed-off-by: Anthony Chang <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
…timate-regex-complexity
Signed-off-by: Anthony Chang <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala
Outdated
Show resolved
Hide resolved
Co-authored-by: Andy Grove <[email protected]>
This is looking good, but I would like to see some tests demonstrating the costs of different regular expressions. |
Signed-off-by: Anthony Chang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add integration tests to simulate the fallback condition?
…g when running integration tests Signed-off-by: Anthony Chang <[email protected]>
Signed-off-by: Anthony Chang <[email protected]>
…timate-regex-complexity Signed-off-by: Anthony Chang <[email protected]>
Signed-off-by: Anthony Chang <[email protected]>
@NVnavkumar @andygrove I added some python tests for both the fall back and no fallback scenarios but they rely on knowing the default string size in order to set the target batch size so that the row count is 1. I'm not sure if this is a good approach to testing it but I can't think of another way without exposing the Do you think these tests are sufficient? |
LGTM. I think the basic mechanics are at least tested here. The default value for this will be very high at first anyways, so it should not affect most users, as it is just a fallback mechanism in case something cannot be handled on the GPU. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @anthony-chang
Signed-off-by: Anthony Chang <[email protected]>
…timate-regex-complexity
ef03297
build |
Addresses #4061
Relevant cuDF PR: rapidsai/cudf#10808
This attempts to do a worst-case estimation for the memory footprint for a regular expression operation. We can evaluate regular expressions by converting them into a deterministic finite automaton (DFA) and counting the number states in this state machine. Since each state is a possibility in the evaluation, we need to store at worst all these states.
This estimation is done in the planning stage on the driver node, so we only have access to the regex pattern and not the column data so we can only do a rough estimation for the row count using the target batch size and the default string type size.