[FEA] avoid regexp capture groups automatically in some cases #11663
Labels
feature request
New feature or request
Performance
Performance related issue
Spark
Functionality that helps Spark RAPIDS
Is your feature request related to a problem? Please describe.
As a part of debugging NVIDIA/spark-rapids#6431 we found that there is a large performance difference between using a capture group in a regular expression vs a non-capture group. See NVIDIA/spark-rapids#6431 (comment) for some details on the performance difference.
I don't understand why a capture group would be so expensive if that capture group is never referenced by any sort of a back reference, and the operation being done, like with
matches_re
.Describe the solution you'd like
I would like to have CUDF automatically detect that a capture group does not need to actually need to capture anything.
Describe alternatives you've considered
We do it ourselves manually for any regular expressions that we have hard coded, which is in progress.
Also when we trans-pile the user's regular expression we could do the same thing I am asking us to do here, but why should only Spark users take advantage of this.
The text was updated successfully, but these errors were encountered: