Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta Lake vs Parquet benchmarks for different query engines #2012

Open
MrPowers opened this issue Jan 2, 2024 · 5 comments
Open

Delta Lake vs Parquet benchmarks for different query engines #2012

MrPowers opened this issue Jan 2, 2024 · 5 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@MrPowers
Copy link
Collaborator

MrPowers commented Jan 2, 2024

As mentioned by @djouallah in this PR, there are some queries where Parquet outperforms Delta Lake for DataFusion.

I mentioned in the thread how data for a certain query can be optimally distributed in a Parquet file and poorly distributed in a Delta table which might cause these differences.

In any case, I think it would be useful to have some benchmarks that show the performance differences of some queries on a Parquet file vs Delta Lake. The TPCH queries in this notebook seem like a reasonable starting point.

Some benchmarks showing some realistic end-to-end query patterns would be cool too, for example:

  • convert a CSV file to Parquet / Delta Lake
  • Delete some rows
  • Upsert some data
  • Run a query
@MrPowers MrPowers added the enhancement New feature or request label Jan 2, 2024
@rtyler rtyler added the documentation Improvements or additions to documentation label Jan 5, 2024
@djouallah
Copy link

now I have maybe a more useful use case, if you compare glaredb which uses datafusion and delta_rs vs datafusion with dataset generated by delta-rs python, there is a non trivial difference

image

@ion-elgreco
Copy link
Collaborator

now I have maybe a more useful use case, if you compare glaredb which uses datafusion and delta_rs vs datafusion with dataset generated by delta-rs python, there is a non trivial difference

image

@djouallah can you try out polars-deltalake and share if you see improvements there?

@djouallah
Copy link

@ion-elgreco last time i checked polars did not support the full tpch SQL

@ion-elgreco
Copy link
Collaborator

@djouallah I mean using my Polars extension which does native reads with Polars engine: https://pypi.org/project/polars-deltalake/

@djouallah
Copy link

@djouallah I mean using my Polars extension which does native reads with Polars engine: https://pypi.org/project/polars-deltalake/

I understand, the benchmarks uses SQL, polars has a limited sql support, so I can't run the test unfortunately yet :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants