-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of regex_replace
#3518
Comments
regex_replace
I definitely agree that there is a lot of room for improvement. Was playing with a couple of ideas, and especially for the second point you've mentioned I think it might be a decent-ish gain by itself. Took the liberty of creating a separate issue describing the approach, #3613 FYI. Maybe we can treat this one as the meta issue. |
I was looking into this one; but even on cases where the array is very sparse (in testings, I've used %20 data / %80 nulls), there doesn't seem to be an observable difference between building the underlying data buffers by ourselves and reusing the existing null buffer vs rebuilding everything. The experiment is in this branch, and the results are (on release mode):
So a ~5% speed-up, but I highly suspect it might be just noise. (There is also a chance that I completely misunderstood the concept 😄 so am very open to input on the experiment code on what else I can try). |
It might be that regexes themselves are so expensive, that the "null buffer" reuse has minimal benefit. |
Initial profiling indicates even with a very simple regex, for the query in that example, we spent around ~%25 of the whole execution time in So perhaps we could extend the initial adapter with a system that can also receive hints (regarding whether the arrays needs to be padded or not). |
Now benchmarking null buffer re-use code on top of #3762, I can see a ~%20 improvement on a very sparse array. It might be worth it, but I also have to benchmark without unsafe (one question would be whether it is fine to have an |
@Dandandan should we close this issue as well since all the sub-issues are now resolved, #3613, #3762, #3803? |
Thanks @isidentical that makes a lot of sense! |
Thank you for the amazing ideas / pointers and your reviews ❤️ |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently by far the worst performing query in ClickBench for datafusion is query 28:
It seems this is due too
REGEXP_REPLACE
in this query.Describe the solution you'd like
Introduce a benchmark and improve the performance of
REGEXP_REPLACE
(with a scalar) in DataFusion.Describe alternatives you've considered
n/a
Additional context
There are some more efficient patterns in arrow-rs for writing (string/regex) kernels. This could be used here as well.
Optimizations that might be beneficial:
The text was updated successfully, but these errors were encountered: