Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of regex_replace #3518

Closed
Dandandan opened this issue Sep 16, 2022 · 8 comments
Closed

Improve performance of regex_replace #3518

Dandandan opened this issue Sep 16, 2022 · 8 comments
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently by far the worst performing query in ClickBench for datafusion is query 28:

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

It seems this is due too REGEXP_REPLACE in this query.

Describe the solution you'd like
Introduce a benchmark and improve the performance of REGEXP_REPLACE (with a scalar) in DataFusion.

Describe alternatives you've considered
n/a
Additional context

There are some more efficient patterns in arrow-rs for writing (string/regex) kernels. This could be used here as well.
Optimizations that might be beneficial:

  • reusing the null buffer from the input array (instead of rebuilding in the iterator)
  • special casing scalar type, so no hashmap has to be used for storing / retrieving compiled regexes
@Dandandan Dandandan added enhancement New feature or request performance Make DataFusion faster labels Sep 16, 2022
@Dandandan Dandandan changed the title Improve performance of `regex_replace Improve performance of regex_replace Sep 16, 2022
@isidentical
Copy link
Contributor

I definitely agree that there is a lot of room for improvement. Was playing with a couple of ideas, and especially for the second point you've mentioned I think it might be a decent-ish gain by itself. Took the liberty of creating a separate issue describing the approach, #3613 FYI. Maybe we can treat this one as the meta issue.

@isidentical
Copy link
Contributor

reusing the null buffer from the input array (instead of rebuilding in the iterator)

I was looking into this one; but even on cases where the array is very sparse (in testings, I've used %20 data / %80 nulls), there doesn't seem to be an observable difference between building the underlying data buffers by ourselves and reusing the existing null buffer vs rebuilding everything. The experiment is in this branch, and the results are (on release mode):

  • baseline is ~0.721s ish
  • that branch is ~0.690s ish

So a ~5% speed-up, but I highly suspect it might be just noise. (There is also a chance that I completely misunderstood the concept 😄 so am very open to input on the experiment code on what else I can try).

@Dandandan
Copy link
Contributor Author

It might be that regexes themselves are so expensive, that the "null buffer" reuse has minimal benefit.

@isidentical
Copy link
Contributor

It might be that regexes themselves are so expensive, that the "null buffer" reuse has minimal benefit.

Initial profiling indicates even with a very simple regex, for the query in that example, we spent around ~%25 of the whole execution time in into_array which is due to our usage of adapter even in the specialized mode.

https://cs.github.com/apache/arrow-datafusion/blob/15289610318e4acad48e40f5adabe0c5a9e8f9b9/datafusion/physical-expr/src/regex_expressions.rs#L307

So perhaps we could extend the initial adapter with a system that can also receive hints (regarding whether the arrays needs to be padded or not).

@isidentical
Copy link
Contributor

Now benchmarking null buffer re-use code on top of #3762, I can see a ~%20 improvement on a very sparse array. It might be worth it, but I also have to benchmark without unsafe (one question would be whether it is fine to have an unsafe in the implementation for a relatively low speed-up).

@isidentical
Copy link
Contributor

@Dandandan should we close this issue as well since all the sub-issues are now resolved, #3613, #3762, #3803?

@Dandandan
Copy link
Contributor Author

Thanks @isidentical that makes a lot of sense!
Thank you for all the improvements🙏

@isidentical
Copy link
Contributor

Thank you for the amazing ideas / pointers and your reviews ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

2 participants