Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Optimize regexp_replace in multi-replace scenarios #7907

Closed
NVnavkumar opened this issue Mar 20, 2023 · 0 comments · Fixed by #7967
Closed

[FEA] Optimize regexp_replace in multi-replace scenarios #7907

NVnavkumar opened this issue Mar 20, 2023 · 0 comments · Fixed by #7967
Assignees
Labels
performance A performance related task/issue

Comments

@NVnavkumar
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
When seeing regular expressions like ab|cd|ef in the replace API, the performance is not great for large strings. We can actually optimize this for running on the GPU using the cudf native multi-replace API. This API was recently updated for performance fixes for large strings in rapidsai/cudf#12858. So we should also measure how performant this optimization could be based on input string size and number of target strings.

Describe the solution you'd like
The RAPIDS Accelerator for Apache Spark should utilize this optimization whenever it makes sense based on performance considerations.

Describe alternatives you've considered
regexp_replace does not perform well on the GPU when the input strings are very large. In addition, choices can add some complexity to the regular expression as there are more options. We can parallelize this on the GPU using cudf API described earlier.

@NVnavkumar NVnavkumar added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 20, 2023
@NVnavkumar NVnavkumar self-assigned this Mar 20, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Mar 21, 2023
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Mar 23, 2023
Adds the JNI API for `stringReplace` using column vector arguments for `targets` and `repls` (to make this consistent with the C++ API). Also adds unit tests for the new API.
Part of the work for NVIDIA/spark-rapids#7907.

Authors:
  - Navin Kumar (https://github.com/NVnavkumar)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)

URL: #12979
@NVnavkumar NVnavkumar added the performance A performance related task/issue label Mar 31, 2023
@sameerz sameerz removed the feature request New feature or request label Apr 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
3 participants