[FEA] Rewrite some regexp_replace
cases to faster expressions
#11809
Labels
invalid
This doesn't seem right
regexp_replace
cases to faster expressions
#11809
Is your feature request related to a problem? Please describe.
Similar to #10741,
regexp_replace
is also expensive, but many use cases are just simple string replacements, likeregexp_replace('foo', 'o', 'a')
orregexp_replace('foo', 'o|f', '')
.We can rewrite these case to
GpuReplace
or multipleGpuReplace
calls to make them much faster.Describe the solution you'd like
We have a regex parser in plugin code here to translate a regex pattern to cudf supported style and check fallback. We can reuse it if possible to match if it is a simple pattern that can be replaced, and replace that case to the faster expressions.
Planned cases to support:
regexp_replace(str, 'a', 'b')
->GpuReplace(str, 'a', 'b')
regexp_replace(str, 'a|b', 'c')
->GpuReplace(GpuReplace(str, 'a', 'c'), 'b', 'c')
For multiple replacements cases, we can see that the regex_replace is not equivalent to multiple replacements. For example:
regexp_replace('abc', 'a|ab', 'a') -> 'abc'
replace(replace('abc', 'a', 'a'), 'ab', 'a') -> 'ac'
Describe alternatives you've considered
Now we have
multi_contains
#11413 which is super fast. We can try to check if the value contains these characters first, if not, we can return the original value as is. Not sure if it is faster than replacing directly. Related to #11729We can also try to write a custom kernel to support multiple string replacement, like we did in
multi_contains
.For single character replacements, maybe
GpuTranslate
can be faster thanGpuReplace
.The text was updated successfully, but these errors were encountered: