Reported very slow performance compared to DuckDB in ibis-project #8492

alamb · 2023-12-11T12:26:43Z

Describe the bug

As reported by @cpcloud in ibis-project/ibis#7703

The most relevant portions:

DataFusion
DataFusion never ran out of memory and had a memory profile similar to DuckDB:
single digit GBs peak memory.

However, it was still extremely slow compared to DuckDB, about 9-10 minutes to
run the whole workload.

Similarly to Polars I compared both the Ibis implementation and a hand-written
SQL version (built from the generated Ibis code). Both had the same performance

I also looked at perf top while the DataFusion workload was running and saw this:

To Reproduce

TBD (first thing would be to get a datafusion only reproducer)

Looks like the query, from ibis-project/ibis#7703 is

SELECT
  month,
  ext,
  COUNT(DISTINCT project_name) AS project_count
FROM (
  SELECT
    project_name,
    DATE_TRUNC('month', uploaded_on) AS month,
    NULLIF(
      REPLACE(
        REPLACE(
          REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
                REGEXP_MATCH(path, CONCAT('(', '\.([a-z0-9]+)$', ')'))[2],
                'cxx|cpp|cc|c|hpp|h',
                'C/C++',
                'g'
              ),
              '^f.*$',
              'Fortran',
              'g'
            ),
            'rs',
            'Rust'
          ),
          'go',
          'Go'
        ),
        'asm',
        'Assembly'
      ),
      ''
    ) AS ext
  FROM pypi
  WHERE COALESCE(
      ARRAY_LENGTH(
        REGEXP_MATCH(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
      ) > 0,
      FALSE
    )
    AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 0, FALSE)
    AND NOT STRPOS(path, '/site-packages/') > 0
)
WHERE ext IS NOT NULL
GROUP BY month, ext
ORDER BY month DESC, project_count DESC

Expected behavior

No response

Additional context

No response

alamb · 2023-12-11T12:28:50Z

I wonder if the slowness in regexp / time spent compiling stuff is related to not being able to pre-compile the argument and instead re-creating the regular expression for each batch.

@thinkharderdev mentioned something similar for #8051

cpcloud · 2023-12-11T12:33:29Z

Thanks for creating the issue!

The example can be simplified a bit.

It should be sufficient to see the performance difference with DuckDB by:

Using any single file from the dataset (they're all big enough to take some noticeable amount of time)
Removing the aggregation and using only the inner SELECT.

Dandandan · 2023-12-12T10:15:26Z

A similar problem (recompiling the regex again and again) I found some time ago in the clickbench benchmark as well (query 28):
https://github.com/JayjeetAtGithub/datafusion-duckdb-benchmark/blob/main/clickbench/queries-datafusion.sql#L29

zhangxffff · 2023-12-14T10:33:21Z

test with simple regexp_match query with index-0.parquet
array-datafusion took 47.847 seconds.

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_MATCH(path, '\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;


DataFusion CLI v33.0.0
+----------+
| COUNT(*) |
+----------+
| 5834398  |
+----------+
1 row in set. Query took 47.847 seconds.

../datafusion-cli/target/release/datafusion-cli -f d0.sql  225.03s user 5.77s system 477% cpu 48.353 total

duckdb took 3.029 seconds

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_EXTRACT_ALL(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;


┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      5834398 │
└──────────────┘
../../duckdb/build/release/duckdb -s "`cat d1.sql`"  15.70s user 1.78s system 576% cpu 3.029 total

Weijun-H · 2023-12-14T13:30:53Z

test with simple regexp_match query with index-0.parquet array-datafusion took 47.847 seconds.

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_MATCH(path, '\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;


DataFusion CLI v33.0.0
+----------+
| COUNT(*) |
+----------+
| 5834398  |
+----------+
1 row in set. Query took 47.847 seconds.

../datafusion-cli/target/release/datafusion-cli -f d0.sql  225.03s user 5.77s system 477% cpu 48.353 total

duckdb took 3.029 seconds

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_EXTRACT_ALL(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;


┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      5834398 │
└──────────────┘
../../duckdb/build/release/duckdb -s "`cat d1.sql`"  15.70s user 1.78s system 576% cpu 3.029 total

Maybe related to this #8524

cpcloud · 2023-12-14T13:33:34Z

Seems unlikely to be related to #8524. The issue is present for a single file.

alamb · 2023-12-14T14:29:05Z

I had some ideas on how to speed up regular expression evaluation here: #8051 (comment)

comphead · 2023-12-15T04:05:16Z

Attaching Flamegraph for the query

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_EXTRACT_ALL(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;

alamb · 2023-12-15T12:21:53Z

The blog post is now published: https://ibis-project.org/posts/pydata-performance-part2/

alamb · 2023-12-15T12:34:03Z

I looked into the regular expression matching code in DataFusion -- there is a lot of room for improvement:

It translates each argument into an array (even when the argument is a constant). Thus DataFusion is effectively compiling the regular expression for each row (not even each batch) which is unsurprisingly quite expensive

This is very fixable but the way the functions are wired in will take some finagling I think

alamb · 2023-12-23T12:02:02Z

@viirya found some very low hanging fruit: ~~#8538~~ #8631

cpcloud · 2023-12-23T14:00:31Z

Happy to try the query again once the next release is out!

viirya · 2023-12-23T19:23:19Z

It is #8631 actually. 😄

my-vegetable-has-exploded · 2024-03-23T05:15:08Z

Attaching Flamegraph for the query

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_EXTRACT_ALL(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;

Hi, could you share the method to draw the flamegraph?

There is a large "unknown" in my flame graph, which prevents me from tracking the function call chain.

Here is my shell to generate the flamegraph using datafusion-cli.

#!/bin/bash
PERF_DATA_FILE="$1"
../../datafusion-cli/target/debug/datafusion-cli -f repro-range-query.sql &
# datafusion-cli -f repro-range-query.sql &
PID=$!
# use perf to collect data
sudo /usr/bin/perf_5.15 record --call-graph=dwarf -e cpu-clock -F 100 -p $PID -- sleep 30
sudo /usr/bin/perf_5.15 script -i perf.data >"perf.unfold"
FLAMEGRAPH_DIR="/home/deepin/sdk/FlameGraph"
"$FLAMEGRAPH_DIR/stackcollapse-perf.pl" "perf.unfold" >"perf.folded"

"$FLAMEGRAPH_DIR/flamegraph.pl" "perf.folded" >"$1.svg"

rm "perf.unfold" "perf.folded"
echo "Flame graph SVG created: $1.svg"

Thanks very much. @comphead

alamb · 2024-03-23T10:48:52Z

Hi, could you share the method to draw the flamegraph?

@comphead recently added this to the contributor guide: https://arrow.apache.org/datafusion/library-user-guide/profiling.html#building-a-flamegraph

alamb added the bug Something isn't working label Dec 11, 2023

alamb mentioned this issue Dec 11, 2023

docs(blog-post): pydata performance part 2; polars and datafusion ibis-project/ibis#7703

Merged

alamb added the help wanted Extra attention is needed label Dec 11, 2023

viirya mentioned this issue Dec 22, 2023

Improve regexp_match performance by avoiding cloning Regex #8631

Merged

viirya closed this as completed in #8631 Dec 24, 2023

alamb mentioned this issue Mar 5, 2024

Specialized / Pre-compiled / Prepared ScalarUDFs #8051

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reported very slow performance compared to DuckDB in ibis-project #8492

Reported very slow performance compared to DuckDB in ibis-project #8492

alamb commented Dec 11, 2023

alamb commented Dec 11, 2023

cpcloud commented Dec 11, 2023 •

edited

Loading

Dandandan commented Dec 12, 2023

zhangxffff commented Dec 14, 2023

Weijun-H commented Dec 14, 2023

cpcloud commented Dec 14, 2023

alamb commented Dec 14, 2023

comphead commented Dec 15, 2023

alamb commented Dec 15, 2023

alamb commented Dec 15, 2023

alamb commented Dec 23, 2023 •

edited

Loading

cpcloud commented Dec 23, 2023 •

edited

Loading

viirya commented Dec 23, 2023

my-vegetable-has-exploded commented Mar 23, 2024 •

edited

Loading

alamb commented Mar 23, 2024

Reported very slow performance compared to DuckDB in ibis-project #8492

Reported very slow performance compared to DuckDB in ibis-project #8492

Comments

alamb commented Dec 11, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Dec 11, 2023

cpcloud commented Dec 11, 2023 • edited Loading

Dandandan commented Dec 12, 2023

zhangxffff commented Dec 14, 2023

Weijun-H commented Dec 14, 2023

cpcloud commented Dec 14, 2023

alamb commented Dec 14, 2023

comphead commented Dec 15, 2023

alamb commented Dec 15, 2023

alamb commented Dec 15, 2023

alamb commented Dec 23, 2023 • edited Loading

cpcloud commented Dec 23, 2023 • edited Loading

viirya commented Dec 23, 2023

my-vegetable-has-exploded commented Mar 23, 2024 • edited Loading

alamb commented Mar 23, 2024

cpcloud commented Dec 11, 2023 •

edited

Loading

alamb commented Dec 23, 2023 •

edited

Loading

cpcloud commented Dec 23, 2023 •

edited

Loading

my-vegetable-has-exploded commented Mar 23, 2024 •

edited

Loading