Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] avoid regexp capture groups automatically in some cases #11663

Closed
revans2 opened this issue Sep 7, 2022 · 1 comment · Fixed by #11695
Closed

[FEA] avoid regexp capture groups automatically in some cases #11663

revans2 opened this issue Sep 7, 2022 · 1 comment · Fixed by #11695
Assignees
Labels
feature request New feature or request Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Sep 7, 2022

Is your feature request related to a problem? Please describe.
As a part of debugging NVIDIA/spark-rapids#6431 we found that there is a large performance difference between using a capture group in a regular expression vs a non-capture group. See NVIDIA/spark-rapids#6431 (comment) for some details on the performance difference.

I don't understand why a capture group would be so expensive if that capture group is never referenced by any sort of a back reference, and the operation being done, like with matches_re.

Describe the solution you'd like
I would like to have CUDF automatically detect that a capture group does not need to actually need to capture anything.

Describe alternatives you've considered
We do it ourselves manually for any regular expressions that we have hard coded, which is in progress.
Also when we trans-pile the user's regular expression we could do the same thing I am asking us to do here, but why should only Spark users take advantage of this.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Sep 7, 2022
@davidwendt
Copy link
Contributor

A capture group certainly adds more instructions than a non-capture group.

@davidwendt davidwendt self-assigned this Sep 12, 2022
rapids-bot bot pushed a commit that referenced this issue Sep 20, 2022
…ps (#11695)

Capture groups are used for extracting specific matching substrings but also used for grouping alternation or other sub-pattern matches. If the capture group is not used for extraction then a non-capture group could be specified for these cases. A non-capture group will generate less regex instructions which can help reduce device memory usage.
Since the libcudf strings regex API calls already check where capture groups are required, the API can inform the regex compiler if capture groups are necessary. Then the compiler could automatically convert to non-capture groups reducing the number of instructions produced.
Introduces a new `capture_groups` parameter for use in the regex compiler step for this purpose.
This is an improvement in efficiency and no external behavior has changed.

Also fixes a bug found when testing where a non-capture group pattern is used with an invalid quantifier sequence.
A test case was added to verify the bug fix.

Closes #11663

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Tobias Ribizel (https://github.com/upsj)

URL: #11695
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants