-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: datafusion integration #1993
Conversation
I think it may be useful to clarify that using arrow dataset to read delta table is just a workaround with some serious limitations, currently as far as i can tell stats are not passed, using a simple benchmarks, reading Parquet directly is substantially Faster https://colab.research.google.com/drive/1sJD7w6l7RUjRHoPKoM4EQGcfwRqfKCqk#scrollTo=KMX-DymJKIh4 |
@djouallah which pyarrow and deltalake version did you use? |
14.0.2, 0.14 |
just as a reference |
@djouallah could you add polars and duckdb in the mix so we can compare across engines? In the end having native readers would be better, I raised an issue at Polars for this to get parquet dataset abstraction so we can get better read performance instead of going through pyarrow. |
Polars does not support arbitrary SQL, so I can't use it, I will add Duckdb parquet vs delta, but same problem, total rows are not passed, so duckdb end up with weird Query plans |
Polars has a sql context: https://pola-rs.github.io/polars/user-guide/sql/intro/#execute-queries-from-multiple-sources |
I know but last time, i tried, it did not support the whole 22 Queries |
@djouallah - stats are passed. The runtime on Delta Lake vs Parquet for a small dataset is quite volatile and really depends how the data is distributed. For example, suppose you need to query 1% of the data and the entire dataset is 50GB. The Delta Table could have 99% data skipping or 0% data skipping. The Parquet table could also have 99% data skipping (only the relevant data is in one of the row groups) or 0% data skipping. In order to make an apples:apples Parquet:Delta Lake comparison, the file distribution should be similar to the row group distribution. I am running these queries on a 50 GB dataset locally (I have a Macbook M1 with 64 GB of RAM). The query runs in ~5 seconds - pretty fast! |
|
@MrPowers , feel free to check the notebook, it is totally reproducible, if you increase sf to 5, the performance difference will be substantial , running a Query on a single table may not show the issue, it is more about joins reordering I think. |
Purpose: document the DataFusion integration page.
Need to figure out why Delta Lake depends on DataFusion and put a little info in this guide before merging.