[FEA] Efficient partition_cols for parquet writer #5059

chinmaychandak · 2020-04-30T17:48:55Z

I remember us discussing that the partition_cols mechanism in the current GPU-acc parquet writer can be improved considerably. Was this w.r.t. the CPU parquet writer, or just a boost in performance on how the GPU handles stuff? Has anyone started looking into this?

The text was updated successfully, but these errors were encountered:

OlivierNV · 2020-05-01T16:55:48Z

From what I remember of the discussion a while back, the partitioning would happen mostly upstream from the writer. For the writer, there was talk about the multi-file partitioned sink support where the writer could be provided with a vector of {row, numrows} pairs to write pieces of a table in separate files (which speeds up the encoding for many small partitions vs multiple write_parquet calls per sliced table view).

chinmaychandak · 2020-05-01T17:05:57Z

Thanks for providing some clarity, Olivier! @kkraus14, any ideas here?

OlivierNV · 2020-05-01T18:35:16Z

Note that the biggest perf problem right now is that the python code will fall back to the pyarrow CPU writer if partition_cols is anything other than None.
It was recently pointed out that df.to_parquet() is faster than df.to_arrow(), so it's like the GPU is on its second parquet victory lap before the CPU even arrived at the starting line (any benefits from multifile parquet writer support are likely minor in comparison).

kkraus14 · 2020-05-02T00:45:49Z

It was recently pointed out that df.to_parquet() is faster than df.to_arrow(), so it's like the GPU is on its second parquet victory lap before the CPU even arrived at the starting line (any benefits from multifile parquet writer support are likely minor in comparison).

Yea we need to spend a bit of time looking at this code, but likely the overheads of doing a memory copy per buffer serially is killing us and there's a lot of room for improvement.

github-actions · 2021-03-14T19:13:37Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Contributes to #5059 Adds libcudf support for writing partitioned datasets in parquet writer. With the new API, one can specify a vector of `{start_row, num_rows}` structs along with a table st slices of the input table gets written to the corresponding sink. Adds Multi-sink support in `sink_info` Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9810

Makes use of the efficient partitioned writing support added in #9810 to improve performance of partitioned parquet dataset writing. Closes #5059 Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #9971

chinmaychandak added Needs Triage Need team to review and classify feature request New feature or request labels Apr 30, 2020

kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 1, 2020

kkraus14 mentioned this issue Dec 7, 2020

[FEA] cudf.read_parquet support for column partitioned parquet datasets #6915

Closed

github-actions bot added the inactive-90d label Mar 14, 2021

devavret self-assigned this Sep 8, 2021

devavret mentioned this issue Dec 14, 2021

Add partitioning support in parquet writer #9810

Merged

devavret mentioned this issue Jan 4, 2022

Use new efficient partitioned parquet writing in cuDF #9971

Merged

rapids-bot bot closed this as completed in #9971 Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Efficient partition_cols for parquet writer #5059

[FEA] Efficient partition_cols for parquet writer #5059

chinmaychandak commented Apr 30, 2020

OlivierNV commented May 1, 2020

chinmaychandak commented May 1, 2020

OlivierNV commented May 1, 2020

kkraus14 commented May 2, 2020

github-actions bot commented Mar 14, 2021

[FEA] Efficient partition_cols for parquet writer #5059

[FEA] Efficient partition_cols for parquet writer #5059

Comments

chinmaychandak commented Apr 30, 2020

OlivierNV commented May 1, 2020

chinmaychandak commented May 1, 2020

OlivierNV commented May 1, 2020

kkraus14 commented May 2, 2020

github-actions bot commented Mar 14, 2021