Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add planning benchmarks with parquet and sortedness #13098

Closed
alamb opened this issue Oct 24, 2024 · 1 comment · Fixed by #13103
Closed

Add planning benchmarks with parquet and sortedness #13098

alamb opened this issue Oct 24, 2024 · 1 comment · Fixed by #13103
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

Is your feature request related to a problem or challenge?

@mnorfolk03 added planning benchmark for more sophisticated queries here #13085 ❤️

The benchmarks are in https://github.com/apache/datafusion/blob/main/datafusion/core/benches/sql_planner.rs

However, the planning benchmarks we have now don't reflect querying an actual data source such as parquet (they query an empty in-memory table)

One thing that might be helpful to improve more would be adding a ParquetExec as well as queries that have sortedness to reflect more real world cases

Describe the solution you'd like

I would like some planning benchmarks equivalent of planning against tables like this (docs here): https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table

CREATE EXTERNAL TABLE foo STORED AS PARQUET LOCATION '..'
CREATE EXTERNAL TABLE test (
    c1  VARCHAR NOT NULL,
    c2  INT NOT NULL,
    c3  SMALLINT NOT NULL,
    c4  SMALLINT NOT NULL,
    c5  INT NOT NULL,
    c6  BIGINT NOT NULL,
    c7  SMALLINT NOT NULL,
    c8  INT NOT NULL,
    c9  BIGINT NOT NULL,
    c10 VARCHAR NOT NULL,
    c11 FLOAT NOT NULL,
    c12 DOUBLE NOT NULL,
    c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH ORDER (c2 ASC, c5 + c8 DESC NULL FIRST)
LOCATION '/path/to/aggregate_test_100.csv'
OPTIONS ('has_header' 'true');

Describe alternatives you've considered

One possibility could be to add a benchmark for planning the clickbench queries: https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench

We could either use the smaller hits.parquet file here: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/data/clickbench_hits_10.parquet

Additional context

No response

@Omega359
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants