Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RLike: Fall back to CPU for regex that would produce incorrect results #4044

Merged
merged 15 commits into from
Nov 9, 2021

Conversation

andygrove
Copy link
Contributor

Closes #3797

This PR introduces a lightweight regular expression parser that allows us to inspect patterns to determine if they can be supported on GPU or not so that we can fall back to CPU in those cases. In most cases, this is necessary to handle edge cases that would cause cuDF to throw an invalid regex: nothing to match exception. Examples include:

  • Possessive quantifiers: a*+
  • Empty groups: a()?
  • Choice where one side is potentially empty: ^|a or a*|b

There are other cases where Java has support for advanced regex features that are not available in cuDF:

  • Complex character class usage such as [a-d[m-p]]

There is also the beginning of a transpiler so that we can alter the pattern before passing it to cuDF. So far there is only one trivial example of this and that is escaping - if it appears within a character class to represent the character - rather than being used to specify a character range, as in [abc-].

This is a large PR with a lot of new functionality and I have been leaning heavily on the fuzzing approach to find differences between CPU and GPU. The fuzz tests are included as part of the new unit test suite.

@andygrove andygrove added this to the Nov 1 - Nov 12 milestone Nov 5, 2021
@andygrove andygrove self-assigned this Nov 5, 2021
@sameerz sameerz added the task Work required that improves the product but is not user facing label Nov 5, 2021
@andygrove
Copy link
Contributor Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great.

regex.toRegexString
}

private def validate(regex: RegexAST): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we want to possibly look at some rules that can re-write some of these? like if we see "()" as the regular expression can we replace it with ".*"? I honestly don't know if that even would work, because I don't remember what java does in this case. This should probably be follow on work if we do want to look into this, because I don't want to hold this up from going in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we should do this but I wanted to take this one step at a time and start off with simply falling back to CPU and then follow up with optimizations so that this PR doesn't become overwhelming to review. Ideally, I think we should follow up with one PR per specific optimization, so we can make sure that each one has comprehensive tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine with me. Being customer driven on what we pull in sounds good. After all most of these are corner cases, hey should be rare, and if we do add a modification step we need a lot of testing to really be sure it is doing the right thing.

// parse the source regular expression
val regex = new RegexParser(pattern).parse()
// validate that the regex is supported by cuDF
validate(regex)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we want to try and validate the size/complexity of the regular expression? I don't know exactly what CUDF does to figure out if it needs a small, medium, large, or crazy big stack/memory, but it looks like we could do something similar, and fall back to the CPU if it is too large. The main reason for this is because we just dropped the default for spark.rapids.memory.gpu.reserve from 1GiB to 256 MiB. The reason we set it at 1GiB was because of hard coded regular expressions that we used. If we are going to fully support arbitrary regular expressions it would be nice to try and tie these two together in some way so we fall back to the CPU if there is not enough reserved memory, or we let users opt into larger regular expressions, but in the instructions we tell them that they need to increase the reserved memory accordingly.

Again this would probably be better as follow on work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a great idea. I will file a follow-on issue for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed as #4061

revans2
revans2 previously approved these changes Nov 9, 2021
jlowe
jlowe previously approved these changes Nov 9, 2021
@jlowe
Copy link
Contributor

jlowe commented Nov 9, 2021

build

@andygrove andygrove dismissed stale reviews from jlowe and revans2 via 62992b1 November 9, 2021 16:49
@revans2
Copy link
Collaborator

revans2 commented Nov 9, 2021

build

@andygrove
Copy link
Contributor Author

build failed with

Unable to find image '***' locally
docker: Error response from daemon: unauthorized: access to the requested resource is not authorized.
See 'docker run --help'.
exit status 125
Error: Process completed with exit code 255.

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit d951ffa into NVIDIA:branch-21.12 Nov 9, 2021
@andygrove andygrove deleted the rlike-support-more-regex branch November 9, 2021 21:24
@andygrove andygrove linked an issue Nov 10, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
4 participants