-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Efficient partition_cols for parquet writer #5059
Comments
From what I remember of the discussion a while back, the partitioning would happen mostly upstream from the writer. For the writer, there was talk about the multi-file partitioned sink support where the writer could be provided with a vector of |
Thanks for providing some clarity, Olivier! @kkraus14, any ideas here? |
Note that the biggest perf problem right now is that the python code will fall back to the pyarrow CPU writer if |
Yea we need to spend a bit of time looking at this code, but likely the overheads of doing a memory copy per buffer serially is killing us and there's a lot of room for improvement. |
This issue has been labeled |
Contributes to #5059 Adds libcudf support for writing partitioned datasets in parquet writer. With the new API, one can specify a vector of `{start_row, num_rows}` structs along with a table st slices of the input table gets written to the corresponding sink. Adds Multi-sink support in `sink_info` Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9810
Makes use of the efficient partitioned writing support added in #9810 to improve performance of partitioned parquet dataset writing. Closes #5059 Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #9971
I remember us discussing that the
partition_cols
mechanism in the current GPU-acc parquet writer can be improved considerably. Was this w.r.t. the CPU parquet writer, or just a boost in performance on how the GPU handles stuff? Has anyone started looking into this?The text was updated successfully, but these errors were encountered: