Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate and validate regular expression complexities #6006

Merged

Conversation

anthony-chang
Copy link
Contributor

@anthony-chang anthony-chang commented Jul 15, 2022

Addresses #4061
Relevant cuDF PR: rapidsai/cudf#10808

This attempts to do a worst-case estimation for the memory footprint for a regular expression operation. We can evaluate regular expressions by converting them into a deterministic finite automaton (DFA) and counting the number states in this state machine. Since each state is a possibility in the evaluation, we need to store at worst all these states.

This estimation is done in the planning stage on the driver node, so we only have access to the regex pattern and not the column data so we can only do a rough estimation for the row count using the target batch size and the default string type size.

@anthony-chang anthony-chang self-assigned this Jul 15, 2022
Signed-off-by: Anthony Chang <[email protected]>
@sameerz sameerz added the task Work required that improves the product but is not user facing label Jul 18, 2022
@anthony-chang anthony-chang marked this pull request as ready for review July 19, 2022 15:38
@andygrove
Copy link
Contributor

This is looking good, but I would like to see some tests demonstrating the costs of different regular expressions.

Signed-off-by: Anthony Chang <[email protected]>
Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add integration tests to simulate the fallback condition?

…g when running integration tests

Signed-off-by: Anthony Chang <[email protected]>
Signed-off-by: Anthony Chang <[email protected]>
Signed-off-by: Anthony Chang <[email protected]>
@anthony-chang
Copy link
Contributor Author

anthony-chang commented Aug 3, 2022

This is looking good, but I would like to see some tests demonstrating the costs of different regular expressions.

@NVnavkumar @andygrove I added some python tests for both the fall back and no fallback scenarios but they rely on knowing the default string size in order to set the target batch size so that the row count is 1. I'm not sure if this is a good approach to testing it but I can't think of another way without exposing the countStates implementation.

Do you think these tests are sufficient?

@anthony-chang anthony-chang changed the base branch from branch-22.08 to branch-22.10 August 5, 2022 19:17
NVnavkumar
NVnavkumar previously approved these changes Aug 10, 2022
@NVnavkumar
Copy link
Collaborator

NVnavkumar commented Aug 10, 2022

LGTM. I think the basic mechanics are at least tested here. The default value for this will be very high at first anyways, so it should not affect most users, as it is just a fallback mechanism in case something cannot be handled on the GPU.

@anthony-chang
Copy link
Contributor Author

build

andygrove
andygrove previously approved these changes Aug 11, 2022
Copy link
Contributor

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @anthony-chang

@anthony-chang anthony-chang dismissed stale reviews from andygrove and NVnavkumar via ef03297 August 11, 2022 15:26
@anthony-chang
Copy link
Contributor Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants