Estimate and validate regular expression complexities #6006

anthony-chang · 2022-07-15T19:11:19Z

Addresses #4061
Relevant cuDF PR: rapidsai/cudf#10808

This attempts to do a worst-case estimation for the memory footprint for a regular expression operation. We can evaluate regular expressions by converting them into a deterministic finite automaton (DFA) and counting the number states in this state machine. Since each state is a possibility in the evaluation, we need to store at worst all these states.

This estimation is done in the planning stage on the driver node, so we only have access to the regex pattern and not the column data so we can only do a rough estimation for the row count using the target batch size and the default string type size.

Signed-off-by: Anthony Chang <[email protected]>

…timate-regex-complexity Signed-off-by: Anthony Chang <[email protected]>

Signed-off-by: Anthony Chang <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

…timate-regex-complexity

Signed-off-by: Anthony Chang <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

…timate-regex-complexity

Signed-off-by: Anthony Chang <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala

Co-authored-by: Andy Grove <[email protected]>

andygrove · 2022-08-02T17:07:06Z

This is looking good, but I would like to see some tests demonstrating the costs of different regular expressions.

Signed-off-by: Anthony Chang <[email protected]>

NVnavkumar

Can we also add integration tests to simulate the fallback condition?

…g when running integration tests Signed-off-by: Anthony Chang <[email protected]>

Signed-off-by: Anthony Chang <[email protected]>

…timate-regex-complexity Signed-off-by: Anthony Chang <[email protected]>

Signed-off-by: Anthony Chang <[email protected]>

anthony-chang · 2022-08-03T16:04:05Z

This is looking good, but I would like to see some tests demonstrating the costs of different regular expressions.

@NVnavkumar @andygrove I added some python tests for both the fall back and no fallback scenarios but they rely on knowing the default string size in order to set the target batch size so that the row count is 1. I'm not sure if this is a good approach to testing it but I can't think of another way without exposing the countStates implementation.

Do you think these tests are sufficient?

NVnavkumar · 2022-08-10T18:03:03Z

LGTM. I think the basic mechanics are at least tested here. The default value for this will be very high at first anyways, so it should not affect most users, as it is just a fallback mechanism in case something cannot be handled on the GPU.

anthony-chang · 2022-08-10T18:53:47Z

build

integration_tests/src/main/python/regexp_test.py

andygrove

LGTM. Thanks @anthony-chang

Signed-off-by: Anthony Chang <[email protected]>

…timate-regex-complexity

anthony-chang · 2022-08-11T15:26:39Z

build

anthony-chang added 5 commits July 12, 2022 19:14

[WIP] initial work on estimating the number of states in a pattern

36d2cc0

Signed-off-by: Anthony Chang <[email protected]>

Change memory estimation formula and add checks

dff377e

Signed-off-by: Anthony Chang <[email protected]>

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into es…

e9db987

…timate-regex-complexity Signed-off-by: Anthony Chang <[email protected]>

Add check for stringsplit

cc4b9a5

Signed-off-by: Anthony Chang <[email protected]>

Return long for gpu memory estimate

3d39c93

Signed-off-by: Anthony Chang <[email protected]>

anthony-chang self-assigned this Jul 15, 2022

Update configs.md

cd1ea93

Signed-off-by: Anthony Chang <[email protected]>

revans2 reviewed Jul 15, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Outdated Show resolved Hide resolved

sameerz added the task Work required that improves the product but is not user facing label Jul 18, 2022

anthony-chang added 3 commits July 18, 2022 18:49

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into es…

69279d3

…timate-regex-complexity

Address feedback

d57ce59

Signed-off-by: Anthony Chang <[email protected]>

Update configs.md

258f3e5

Signed-off-by: Anthony Chang <[email protected]>

anthony-chang marked this pull request as ready for review July 19, 2022 15:38

NVnavkumar requested changes Jul 21, 2022

View reviewed changes

anthony-chang added 2 commits July 21, 2022 15:25

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into es…

f3a8956

…timate-regex-complexity

Address feedback

26d71bc

Signed-off-by: Anthony Chang <[email protected]>

andygrove reviewed Aug 1, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala Show resolved Hide resolved

andygrove reviewed Aug 1, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala Outdated Show resolved Hide resolved

andygrove reviewed Aug 1, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala Outdated Show resolved Hide resolved

Apply suggestions from code review

e9f1a09

Co-authored-by: Andy Grove <[email protected]>

Add comment

a2ab640

Signed-off-by: Anthony Chang <[email protected]>

NVnavkumar reviewed Aug 2, 2022

View reviewed changes

anthony-chang added 4 commits August 2, 2022 19:01

Set regex state memory in integration tests to not trigger the warnin…

7324aef

…g when running integration tests Signed-off-by: Anthony Chang <[email protected]>

Add integration tests

4bb29ba

Signed-off-by: Anthony Chang <[email protected]>

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into es…

d21b491

…timate-regex-complexity Signed-off-by: Anthony Chang <[email protected]>

Update configs.md

0a962d1

Signed-off-by: Anthony Chang <[email protected]>

anthony-chang changed the base branch from branch-22.08 to branch-22.10 August 5, 2022 19:17

NVnavkumar previously approved these changes Aug 10, 2022

View reviewed changes

andygrove reviewed Aug 11, 2022

View reviewed changes

integration_tests/src/main/python/regexp_test.py Show resolved Hide resolved

andygrove previously approved these changes Aug 11, 2022

View reviewed changes

anthony-chang added 2 commits August 11, 2022 11:25

Change regexp_like to rlike

808fdff

Signed-off-by: Anthony Chang <[email protected]>

Merge branch 'branch-22.10' of github.com:NVIDIA/spark-rapids into es…

ef03297

…timate-regex-complexity

anthony-chang dismissed stale reviews from andygrove and NVnavkumar via ef03297 August 11, 2022 15:26

NVnavkumar approved these changes Aug 13, 2022

View reviewed changes

anthony-chang merged commit a1f39e4 into NVIDIA:branch-22.10 Aug 15, 2022

andygrove mentioned this pull request Aug 19, 2022

[FEA] Validate the size/complexity of regular expressions #4061

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate and validate regular expression complexities #6006

Estimate and validate regular expression complexities #6006

anthony-chang commented Jul 15, 2022 •

edited

Loading

andygrove commented Aug 2, 2022

NVnavkumar left a comment

anthony-chang commented Aug 3, 2022 •

edited

Loading

NVnavkumar commented Aug 10, 2022 •

edited

Loading

anthony-chang commented Aug 10, 2022

andygrove left a comment

anthony-chang commented Aug 11, 2022

Estimate and validate regular expression complexities #6006

Estimate and validate regular expression complexities #6006

Conversation

anthony-chang commented Jul 15, 2022 • edited Loading

andygrove commented Aug 2, 2022

NVnavkumar left a comment

Choose a reason for hiding this comment

anthony-chang commented Aug 3, 2022 • edited Loading

NVnavkumar commented Aug 10, 2022 • edited Loading

anthony-chang commented Aug 10, 2022

andygrove left a comment

Choose a reason for hiding this comment

anthony-chang commented Aug 11, 2022

anthony-chang commented Jul 15, 2022 •

edited

Loading

anthony-chang commented Aug 3, 2022 •

edited

Loading

NVnavkumar commented Aug 10, 2022 •

edited

Loading