[FEA] Find ways to convert regular expressions into faster operations #10600

revans2 · 2024-03-15T14:32:04Z

Is your feature request related to a problem? Please describe.
Processing regular expressions on the GPU is really expensive compared to purpose built kernels. We already have done some work to avoid running a regular expression in some cases with StringSplit when the result would just be a regular string.

#4854

And similarly for RegexpReplace when it would just be a list of choices for replacement

#7967

I would like to see us do similar things for RLike. Spark already optimizes LIKE so that if it sees %FOO, it is translated to an ends_with, FOO% to a starts_with, and %FOO% to a contains.

I want to do the same kind of thing but using our regexp transpiler.

We might also want to have some simplification rules for RLIKE in the transpile to make it easier to find these situations instead of hard coding a list of them.

FOO, .*FOO, FOO.*, ^.*FOO, \A.*FOO, ^.*FOO.*$, \A.*FOO.*\Z can be converted into contains
^FOO, ^FOO.*, \AFOO, \AFOO.* can be be converted into starts_with
FOO$, .*FOO$, FOO\Z, .*FOO\Z can be be converted into ends_with

Note that these are after capture groups have been removed/ignored, and FOO represents any pattern that can be converted into a static string, similar to what we do for StringSplit and RegexpReplace. In fact the multi-match that regexp replace does should work for contains because we could use find_multiple and then check for any of the values in the array being >= 0

https://github.com/rapidsai/cudf/blob/610c022164b71614fb92366b3694df6d700283c8/cpp/include/cudf/strings/find_multiple.hpp#L57

We could also do a custom kernel if we see a need for it and want more speed.

There might be more, but a few of these we have see in real world queries and it would be good to try and speed these up. As in all things we should run some benchmarks to really see how much better the speedup is, and if it is worth the added complexity. My guess is that it will always be a win, and for long strings it can be a huge win.

The text was updated successfully, but these errors were encountered:

mattahrens · 2024-04-25T19:01:30Z

Related to: #10741

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Mar 15, 2024

winningsix mentioned this issue Mar 20, 2024

[FEA] Support short-circuit evaluation for expensive expression like rlike #10613

Open

mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Find ways to convert regular expressions into faster operations #10600

[FEA] Find ways to convert regular expressions into faster operations #10600

revans2 commented Mar 15, 2024

mattahrens commented Apr 25, 2024

[FEA] Find ways to convert regular expressions into faster operations #10600

[FEA] Find ways to convert regular expressions into faster operations #10600

Comments

revans2 commented Mar 15, 2024

mattahrens commented Apr 25, 2024