Improve performance of `regex_replace` #3518

Dandandan · 2022-09-16T15:03:36Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently by far the worst performing query in ClickBench for datafusion is query 28:

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

It seems this is due too REGEXP_REPLACE in this query.

Describe the solution you'd like
Introduce a benchmark and improve the performance of REGEXP_REPLACE (with a scalar) in DataFusion.

Describe alternatives you've considered
n/a
Additional context

There are some more efficient patterns in arrow-rs for writing (string/regex) kernels. This could be used here as well.
Optimizations that might be beneficial:

reusing the null buffer from the input array (instead of rebuilding in the iterator)
special casing scalar type, so no hashmap has to be used for storing / retrieving compiled regexes

The text was updated successfully, but these errors were encountered:

isidentical · 2022-09-25T15:42:28Z

I definitely agree that there is a lot of room for improvement. Was playing with a couple of ideas, and especially for the second point you've mentioned I think it might be a decent-ish gain by itself. Took the liberty of creating a separate issue describing the approach, #3613 FYI. Maybe we can treat this one as the meta issue.

isidentical · 2022-10-01T18:51:12Z

reusing the null buffer from the input array (instead of rebuilding in the iterator)

I was looking into this one; but even on cases where the array is very sparse (in testings, I've used %20 data / %80 nulls), there doesn't seem to be an observable difference between building the underlying data buffers by ourselves and reusing the existing null buffer vs rebuilding everything. The experiment is in this branch, and the results are (on release mode):

baseline is ~0.721s ish
that branch is ~0.690s ish

So a ~5% speed-up, but I highly suspect it might be just noise. (There is also a chance that I completely misunderstood the concept 😄 so am very open to input on the experiment code on what else I can try).

Dandandan · 2022-10-01T20:09:15Z

It might be that regexes themselves are so expensive, that the "null buffer" reuse has minimal benefit.

isidentical · 2022-10-07T21:20:25Z

It might be that regexes themselves are so expensive, that the "null buffer" reuse has minimal benefit.

Initial profiling indicates even with a very simple regex, for the query in that example, we spent around ~%25 of the whole execution time in into_array which is due to our usage of adapter even in the specialized mode.

https://cs.github.com/apache/arrow-datafusion/blob/15289610318e4acad48e40f5adabe0c5a9e8f9b9/datafusion/physical-expr/src/regex_expressions.rs#L307

So perhaps we could extend the initial adapter with a system that can also receive hints (regarding whether the arrays needs to be padded or not).

isidentical · 2022-10-10T22:39:52Z

Now benchmarking null buffer re-use code on top of #3762, I can see a ~%20 improvement on a very sparse array. It might be worth it, but I also have to benchmark without unsafe (one question would be whether it is fine to have an unsafe in the implementation for a relatively low speed-up).

isidentical · 2022-10-12T20:20:24Z

@Dandandan should we close this issue as well since all the sub-issues are now resolved, #3613, #3762, #3803?

Dandandan · 2022-10-12T20:23:53Z

Thanks @isidentical that makes a lot of sense!
Thank you for all the improvements🙏

isidentical · 2022-10-12T21:50:18Z

Thank you for the amazing ideas / pointers and your reviews ❤️

Dandandan added enhancement New feature or request performance Make DataFusion faster labels Sep 16, 2022

Dandandan changed the title ~~Improve performance of `regex_replace~~ Improve performance of regex_replace Sep 16, 2022

isidentical mentioned this issue Sep 25, 2022

Optimize regex_replace with a known pattern / replacement #3613

Closed

isidentical mentioned this issue Sep 25, 2022

Optimize regex_replace for scalar patterns #3614

Merged

isidentical mentioned this issue Oct 7, 2022

Allow specialized physical functions to provide hints for the array adapter #3762

Closed

isidentical mentioned this issue Oct 11, 2022

Leverage input array's null buffer for regex replace to optimize sparse arrays #3803

Closed

Dandandan closed this as completed Oct 12, 2022

isidentical mentioned this issue Oct 20, 2022

[WEBSITE] Blog post about DataFusion 13.0.0 apache/arrow-site#254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `regex_replace` #3518

Improve performance of `regex_replace` #3518

Dandandan commented Sep 16, 2022

isidentical commented Sep 25, 2022

isidentical commented Oct 1, 2022

Dandandan commented Oct 1, 2022

isidentical commented Oct 7, 2022

isidentical commented Oct 10, 2022

isidentical commented Oct 12, 2022

Dandandan commented Oct 12, 2022

isidentical commented Oct 12, 2022

Improve performance of regex_replace #3518

Improve performance of regex_replace #3518

Comments

Dandandan commented Sep 16, 2022

isidentical commented Sep 25, 2022

isidentical commented Oct 1, 2022

Dandandan commented Oct 1, 2022

isidentical commented Oct 7, 2022

isidentical commented Oct 10, 2022

isidentical commented Oct 12, 2022

Dandandan commented Oct 12, 2022

isidentical commented Oct 12, 2022

Improve performance of `regex_replace` #3518

Improve performance of `regex_replace` #3518