docs: datafusion integration #1993

MrPowers · 2023-12-26T20:44:50Z

Purpose: document the DataFusion integration page.

Need to figure out why Delta Lake depends on DataFusion and put a little info in this guide before merging.

djouallah · 2023-12-26T22:55:02Z

I think it may be useful to clarify that using arrow dataset to read delta table is just a workaround with some serious limitations, currently as far as i can tell stats are not passed, using a simple benchmarks, reading Parquet directly is substantially Faster

https://colab.research.google.com/drive/1sJD7w6l7RUjRHoPKoM4EQGcfwRqfKCqk#scrollTo=KMX-DymJKIh4

ion-elgreco · 2023-12-26T22:58:08Z

@djouallah which pyarrow and deltalake version did you use?

djouallah · 2023-12-26T23:02:43Z

@djouallah which pyarrow and deltalake version did you use?

14.0.2, 0.14

djouallah · 2023-12-26T23:05:18Z

just as a reference
#1838

ion-elgreco · 2023-12-26T23:07:35Z

@djouallah could you add polars and duckdb in the mix so we can compare across engines?

In the end having native readers would be better, I raised an issue at Polars for this to get parquet dataset abstraction so we can get better read performance instead of going through pyarrow.

djouallah · 2023-12-26T23:12:19Z

@djouallah could you add polars and duckdb in the mix so we can compare across engines?

Polars does not support arbitrary SQL, so I can't use it, I will add Duckdb parquet vs delta, but same problem, total rows are not passed, so duckdb end up with weird Query plans

ion-elgreco · 2023-12-26T23:14:03Z

@djouallah could you add polars and duckdb in the mix so we can compare across engines?

Polars does not support arbitrary SQL, so I can't use it, I will add Duckdb parquet vs delta, but same problem, total rows are not passed, so duckdb end up with weird Query plans

Polars has a sql context: https://pola-rs.github.io/polars/user-guide/sql/intro/#execute-queries-from-multiple-sources

djouallah · 2023-12-26T23:15:53Z

@djouallah could you add polars and duckdb in the mix so we can compare across engines?

Polars does not support arbitrary SQL, so I can't use it, I will add Duckdb parquet vs delta, but same problem, total rows are not passed, so duckdb end up with weird Query plans

Polars has a sql context: https://pola-rs.github.io/polars/user-guide/sql/intro/#execute-queries-from-multiple-sources

I know but last time, i tried, it did not support the whole 22 Queries

MrPowers · 2023-12-26T23:15:56Z

@djouallah - stats are passed. The runtime on Delta Lake vs Parquet for a small dataset is quite volatile and really depends how the data is distributed. For example, suppose you need to query 1% of the data and the entire dataset is 50GB.

The Delta Table could have 99% data skipping or 0% data skipping.

The Parquet table could also have 99% data skipping (only the relevant data is in one of the row groups) or 0% data skipping.

In order to make an apples:apples Parquet:Delta Lake comparison, the file distribution should be similar to the row group distribution.

I am running these queries on a 50 GB dataset locally (I have a Macbook M1 with 64 GB of RAM). The query runs in ~5 seconds - pretty fast!

djouallah · 2023-12-26T23:22:30Z

@djouallah could you add polars and duckdb in the mix so we can compare across engines?

Polars does not support arbitrary SQL, so I can't use it, I will add Duckdb parquet vs delta, but same problem, total rows are not passed, so duckdb end up with weird Query plans

Polars has a sql context: https://pola-rs.github.io/polars/user-guide/sql/intro/#execute-queries-from-multiple-sources
https://colab.research.google.com/drive/1cfWgQW4LoP9RN9rUkfVclku3qfSP3P2w

djouallah · 2023-12-26T23:27:36Z

@djouallah - stats are passed. The runtime on Delta Lake vs Parquet for a small dataset is quite volatile and really depends how the data is distributed. For example, suppose you need to query 1% of the data and the entire dataset is 50GB.

The Delta Table could have 99% data skipping or 0% data skipping.

The Parquet table could also have 99% data skipping (only the relevant data is in one of the row groups) or 0% data skipping.

In order to make an apples:apples Parquet:Delta Lake comparison, the file distribution should be similar to the row group distribution.

I am running these queries on a 50 GB dataset locally (I have a Macbook M1 with 64 GB of RAM). The query runs in ~5 seconds - pretty fast!

@MrPowers , feel free to check the notebook, it is totally reproducible, if you increase sf to 5, the performance difference will be substantial , running a Query on a single table may not show the issue, it is more about joins reordering I think.

MrPowers added 2 commits December 26, 2023 15:43

docs: datafusion integration

7670733

docs: explain why delta-rs depends on datafusion

9bf5bf3

MrPowers marked this pull request as ready for review December 26, 2023 22:02

ion-elgreco approved these changes Dec 29, 2023

View reviewed changes

Merge branch 'main' into docs-datafusion-integration

b9b4821

rtyler enabled auto-merge (rebase) December 29, 2023 19:20

rtyler merged commit 6da3b3b into delta-io:main Dec 29, 2023
24 checks passed

MrPowers mentioned this pull request Jan 2, 2024

Delta Lake vs Parquet benchmarks for different query engines #2012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: datafusion integration #1993

docs: datafusion integration #1993

MrPowers commented Dec 26, 2023

djouallah commented Dec 26, 2023

ion-elgreco commented Dec 26, 2023

djouallah commented Dec 26, 2023

djouallah commented Dec 26, 2023

ion-elgreco commented Dec 26, 2023 •

edited

Loading

djouallah commented Dec 26, 2023

ion-elgreco commented Dec 26, 2023

djouallah commented Dec 26, 2023

MrPowers commented Dec 26, 2023

djouallah commented Dec 26, 2023

djouallah commented Dec 26, 2023

docs: datafusion integration #1993

docs: datafusion integration #1993

Conversation

MrPowers commented Dec 26, 2023

djouallah commented Dec 26, 2023

ion-elgreco commented Dec 26, 2023

djouallah commented Dec 26, 2023

djouallah commented Dec 26, 2023

ion-elgreco commented Dec 26, 2023 • edited Loading

djouallah commented Dec 26, 2023

ion-elgreco commented Dec 26, 2023

djouallah commented Dec 26, 2023

MrPowers commented Dec 26, 2023

djouallah commented Dec 26, 2023

djouallah commented Dec 26, 2023

ion-elgreco commented Dec 26, 2023 •

edited

Loading