Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Rewrite some regexp_replace cases to faster expressions #11809

Closed
2 tasks
thirtiseven opened this issue Dec 3, 2024 · 2 comments
Closed
2 tasks

[FEA] Rewrite some regexp_replace cases to faster expressions #11809

thirtiseven opened this issue Dec 3, 2024 · 2 comments
Labels
invalid This doesn't seem right

Comments

@thirtiseven
Copy link
Collaborator

thirtiseven commented Dec 3, 2024

Is your feature request related to a problem? Please describe.
Similar to #10741, regexp_replace is also expensive, but many use cases are just simple string replacements, like regexp_replace('foo', 'o', 'a') or regexp_replace('foo', 'o|f', '').

We can rewrite these case to GpuReplace or multiple GpuReplace calls to make them much faster.

Describe the solution you'd like
We have a regex parser in plugin code here to translate a regex pattern to cudf supported style and check fallback. We can reuse it if possible to match if it is a simple pattern that can be replaced, and replace that case to the faster expressions.

Planned cases to support:

  • regexp_replace(str, 'a', 'b') -> GpuReplace(str, 'a', 'b')

  • regexp_replace(str, 'a|b', 'c') -> GpuReplace(GpuReplace(str, 'a', 'c'), 'b', 'c')

For multiple replacements cases, we can see that the regex_replace is not equivalent to multiple replacements. For example:

regexp_replace('abc', 'a|ab', 'a') -> 'abc'
replace(replace('abc', 'a', 'a'), 'ab', 'a') -> 'ac'

Describe alternatives you've considered

Now we have multi_contains #11413 which is super fast. We can try to check if the value contains these characters first, if not, we can return the original value as is. Not sure if it is faster than replacing directly. Related to #11729

We can also try to write a custom kernel to support multiple string replacement, like we did in multi_contains.

For single character replacements, maybe GpuTranslate can be faster than GpuReplace.

@thirtiseven thirtiseven added ? - Needs Triage Need team to review and classify feature request New feature or request performance A performance related task/issue labels Dec 3, 2024
@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Dec 3, 2024

Seems we already have this improvement, but we seen a case that didn't be rewrite in some case, maybe a bug. Will update this issue soon.

update: Oh ok the problem is we don't support escape character like \n for this improvement.

@thirtiseven
Copy link
Collaborator Author

ok, close this, will file a new one for escape cases.

@thirtiseven thirtiseven added invalid This doesn't seem right and removed feature request New feature or request performance A performance related task/issue labels Dec 3, 2024
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants