You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Processing regular expressions on the GPU is really expensive compared to purpose built kernels. We already have done some work to avoid running a regular expression in some cases with StringSplit when the result would just be a regular string.
I would like to see us do similar things for RLike. Spark already optimizes LIKE so that if it sees %FOO, it is translated to an ends_with, FOO% to a starts_with, and %FOO% to a contains.
I want to do the same kind of thing but using our regexp transpiler.
We might also want to have some simplification rules for RLIKE in the transpile to make it easier to find these situations instead of hard coding a list of them.
FOO, .*FOO, FOO.*, ^.*FOO, \A.*FOO, ^.*FOO.*$, \A.*FOO.*\Z can be converted into contains
^FOO, ^FOO.*, \AFOO, \AFOO.* can be be converted into starts_with
FOO$, .*FOO$, FOO\Z, .*FOO\Z can be be converted into ends_with
Note that these are after capture groups have been removed/ignored, and FOO represents any pattern that can be converted into a static string, similar to what we do for StringSplit and RegexpReplace. In fact the multi-match that regexp replace does should work for contains because we could use find_multiple and then check for any of the values in the array being >= 0
We could also do a custom kernel if we see a need for it and want more speed.
There might be more, but a few of these we have see in real world queries and it would be good to try and speed these up. As in all things we should run some benchmarks to really see how much better the speedup is, and if it is worth the added complexity. My guess is that it will always be a win, and for long strings it can be a huge win.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Processing regular expressions on the GPU is really expensive compared to purpose built kernels. We already have done some work to avoid running a regular expression in some cases with StringSplit when the result would just be a regular string.
#4854
And similarly for RegexpReplace when it would just be a list of choices for replacement
#7967
I would like to see us do similar things for RLike. Spark already optimizes LIKE so that if it sees
%FOO
, it is translated to anends_with
,FOO%
to astarts_with
, and%FOO%
to acontains
.I want to do the same kind of thing but using our regexp transpiler.
We might also want to have some simplification rules for RLIKE in the transpile to make it easier to find these situations instead of hard coding a list of them.
FOO
,.*FOO
,FOO.*
,^.*FOO
,\A.*FOO
,^.*FOO.*$
,\A.*FOO.*\Z
can be converted intocontains
^FOO
,^FOO.*
,\AFOO
,\AFOO.*
can be be converted intostarts_with
FOO$
,.*FOO$
,FOO\Z
,.*FOO\Z
can be be converted intoends_with
Note that these are after capture groups have been removed/ignored, and
FOO
represents any pattern that can be converted into a static string, similar to what we do forStringSplit
andRegexpReplace
. In fact the multi-match that regexp replace does should work for contains because we could usefind_multiple
and then check for any of the values in the array being >= 0https://github.com/rapidsai/cudf/blob/610c022164b71614fb92366b3694df6d700283c8/cpp/include/cudf/strings/find_multiple.hpp#L57
We could also do a custom kernel if we see a need for it and want more speed.
There might be more, but a few of these we have see in real world queries and it would be good to try and speed these up. As in all things we should run some benchmarks to really see how much better the speedup is, and if it is worth the added complexity. My guess is that it will always be a win, and for long strings it can be a huge win.
The text was updated successfully, but these errors were encountered: