-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reported very slow performance compared to DuckDB in ibis-project #8492
Comments
I wonder if the slowness in regexp / time spent compiling stuff is related to not being able to pre-compile the argument and instead re-creating the regular expression for each batch. @thinkharderdev mentioned something similar for #8051 |
Thanks for creating the issue! The example can be simplified a bit. It should be sufficient to see the performance difference with DuckDB by:
|
A similar problem (recompiling the regex again and again) I found some time ago in the clickbench benchmark as well (query 28): |
test with simple regexp_match query with
|
Maybe related to this #8524 |
Seems unlikely to be related to #8524. The issue is present for a single file. |
I had some ideas on how to speed up regular expression evaluation here: #8051 (comment) |
The blog post is now published: https://ibis-project.org/posts/pydata-performance-part2/ |
I looked into the regular expression matching code in DataFusion -- there is a lot of room for improvement: It translates each argument into an array (even when the argument is a constant). Thus DataFusion is effectively compiling the regular expression for each row (not even each batch) which is unsurprisingly quite expensive This is very fixable but the way the functions are wired in will take some finagling I think |
Happy to try the query again once the next release is out! |
It is #8631 actually. 😄 |
Hi, could you share the method to draw the flamegraph? There is a large "unknown" in my flame graph, which prevents me from tracking the function call chain. Here is my shell to generate the flamegraph using #!/bin/bash
PERF_DATA_FILE="$1"
../../datafusion-cli/target/debug/datafusion-cli -f repro-range-query.sql &
# datafusion-cli -f repro-range-query.sql &
PID=$!
# use perf to collect data
sudo /usr/bin/perf_5.15 record --call-graph=dwarf -e cpu-clock -F 100 -p $PID -- sleep 30
sudo /usr/bin/perf_5.15 script -i perf.data >"perf.unfold"
FLAMEGRAPH_DIR="/home/deepin/sdk/FlameGraph"
"$FLAMEGRAPH_DIR/stackcollapse-perf.pl" "perf.unfold" >"perf.folded"
"$FLAMEGRAPH_DIR/flamegraph.pl" "perf.folded" >"$1.svg"
rm "perf.unfold" "perf.folded"
echo "Flame graph SVG created: $1.svg" Thanks very much. @comphead |
@comphead recently added this to the contributor guide: https://arrow.apache.org/datafusion/library-user-guide/profiling.html#building-a-flamegraph |
Describe the bug
As reported by @cpcloud in ibis-project/ibis#7703
The most relevant portions:
DataFusion
DataFusion never ran out of memory and had a memory profile similar to DuckDB:
single digit GBs peak memory.
However, it was still extremely slow compared to DuckDB, about 9-10 minutes to
run the whole workload.
Similarly to Polars I compared both the Ibis implementation and a hand-written
SQL version (built from the generated Ibis code). Both had the same performance
I also looked at perf top while the DataFusion workload was running and saw this:
To Reproduce
TBD (first thing would be to get a datafusion only reproducer)
Looks like the query, from ibis-project/ibis#7703 is
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: