Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: datafusion integration #1993

Merged
merged 3 commits into from
Dec 29, 2023

Conversation

MrPowers
Copy link
Contributor

Purpose: document the DataFusion integration page.

Need to figure out why Delta Lake depends on DataFusion and put a little info in this guide before merging.

@MrPowers MrPowers marked this pull request as ready for review December 26, 2023 22:02
@djouallah
Copy link

I think it may be useful to clarify that using arrow dataset to read delta table is just a workaround with some serious limitations, currently as far as i can tell stats are not passed, using a simple benchmarks, reading Parquet directly is substantially Faster
image

https://colab.research.google.com/drive/1sJD7w6l7RUjRHoPKoM4EQGcfwRqfKCqk#scrollTo=KMX-DymJKIh4

@ion-elgreco
Copy link
Collaborator

@djouallah which pyarrow and deltalake version did you use?

@djouallah
Copy link

@djouallah which pyarrow and deltalake version did you use?

14.0.2, 0.14

@djouallah
Copy link

just as a reference
#1838

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Dec 26, 2023

@djouallah could you add polars and duckdb in the mix so we can compare across engines?

In the end having native readers would be better, I raised an issue at Polars for this to get parquet dataset abstraction so we can get better read performance instead of going through pyarrow.

@djouallah
Copy link

@djouallah could you add polars and duckdb in the mix so we can compare across engines?

Polars does not support arbitrary SQL, so I can't use it, I will add Duckdb parquet vs delta, but same problem, total rows are not passed, so duckdb end up with weird Query plans

@ion-elgreco
Copy link
Collaborator

@djouallah could you add polars and duckdb in the mix so we can compare across engines?

Polars does not support arbitrary SQL, so I can't use it, I will add Duckdb parquet vs delta, but same problem, total rows are not passed, so duckdb end up with weird Query plans

Polars has a sql context: https://pola-rs.github.io/polars/user-guide/sql/intro/#execute-queries-from-multiple-sources

@djouallah
Copy link

@djouallah could you add polars and duckdb in the mix so we can compare across engines?

Polars does not support arbitrary SQL, so I can't use it, I will add Duckdb parquet vs delta, but same problem, total rows are not passed, so duckdb end up with weird Query plans

Polars has a sql context: https://pola-rs.github.io/polars/user-guide/sql/intro/#execute-queries-from-multiple-sources

I know but last time, i tried, it did not support the whole 22 Queries

@MrPowers
Copy link
Contributor Author

@djouallah - stats are passed. The runtime on Delta Lake vs Parquet for a small dataset is quite volatile and really depends how the data is distributed. For example, suppose you need to query 1% of the data and the entire dataset is 50GB.

The Delta Table could have 99% data skipping or 0% data skipping.

The Parquet table could also have 99% data skipping (only the relevant data is in one of the row groups) or 0% data skipping.

In order to make an apples:apples Parquet:Delta Lake comparison, the file distribution should be similar to the row group distribution.

I am running these queries on a 50 GB dataset locally (I have a Macbook M1 with 64 GB of RAM). The query runs in ~5 seconds - pretty fast!

@djouallah
Copy link

@djouallah could you add polars and duckdb in the mix so we can compare across engines?

Polars does not support arbitrary SQL, so I can't use it, I will add Duckdb parquet vs delta, but same problem, total rows are not passed, so duckdb end up with weird Query plans

Polars has a sql context: https://pola-rs.github.io/polars/user-guide/sql/intro/#execute-queries-from-multiple-sources
https://colab.research.google.com/drive/1cfWgQW4LoP9RN9rUkfVclku3qfSP3P2w

image

@djouallah
Copy link

@djouallah - stats are passed. The runtime on Delta Lake vs Parquet for a small dataset is quite volatile and really depends how the data is distributed. For example, suppose you need to query 1% of the data and the entire dataset is 50GB.

The Delta Table could have 99% data skipping or 0% data skipping.

The Parquet table could also have 99% data skipping (only the relevant data is in one of the row groups) or 0% data skipping.

In order to make an apples:apples Parquet:Delta Lake comparison, the file distribution should be similar to the row group distribution.

I am running these queries on a 50 GB dataset locally (I have a Macbook M1 with 64 GB of RAM). The query runs in ~5 seconds - pretty fast!

@MrPowers , feel free to check the notebook, it is totally reproducible, if you increase sf to 5, the performance difference will be substantial , running a Query on a single table may not show the issue, it is more about joins reordering I think.

@rtyler rtyler enabled auto-merge (rebase) December 29, 2023 19:20
@rtyler rtyler merged commit 6da3b3b into delta-io:main Dec 29, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants