Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Efficient partition_cols for parquet writer #5059

Closed
chinmaychandak opened this issue Apr 30, 2020 · 5 comments · Fixed by #9971
Closed

[FEA] Efficient partition_cols for parquet writer #5059

chinmaychandak opened this issue Apr 30, 2020 · 5 comments · Fixed by #9971
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@chinmaychandak
Copy link
Contributor

I remember us discussing that the partition_cols mechanism in the current GPU-acc parquet writer can be improved considerably. Was this w.r.t. the CPU parquet writer, or just a boost in performance on how the GPU handles stuff? Has anyone started looking into this?

@chinmaychandak chinmaychandak added Needs Triage Need team to review and classify feature request New feature or request labels Apr 30, 2020
@OlivierNV
Copy link
Contributor

From what I remember of the discussion a while back, the partitioning would happen mostly upstream from the writer. For the writer, there was talk about the multi-file partitioned sink support where the writer could be provided with a vector of {row, numrows} pairs to write pieces of a table in separate files (which speeds up the encoding for many small partitions vs multiple write_parquet calls per sliced table view).

@chinmaychandak
Copy link
Contributor Author

Thanks for providing some clarity, Olivier! @kkraus14, any ideas here?

@kkraus14 kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 1, 2020
@OlivierNV
Copy link
Contributor

Note that the biggest perf problem right now is that the python code will fall back to the pyarrow CPU writer if partition_cols is anything other than None.
It was recently pointed out that df.to_parquet() is faster than df.to_arrow(), so it's like the GPU is on its second parquet victory lap before the CPU even arrived at the starting line (any benefits from multifile parquet writer support are likely minor in comparison).

@kkraus14
Copy link
Collaborator

kkraus14 commented May 2, 2020

It was recently pointed out that df.to_parquet() is faster than df.to_arrow(), so it's like the GPU is on its second parquet victory lap before the CPU even arrived at the starting line (any benefits from multifile parquet writer support are likely minor in comparison).

Yea we need to spend a bit of time looking at this code, but likely the overheads of doing a memory copy per buffer serially is killing us and there's a lot of room for improvement.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@devavret devavret self-assigned this Sep 8, 2021
rapids-bot bot pushed a commit that referenced this issue Dec 14, 2021
Contributes to #5059

Adds libcudf support for writing partitioned datasets in parquet writer. With the new API, one can specify a vector of `{start_row, num_rows}` structs along with a table st slices of the input table gets written to the corresponding sink.
Adds Multi-sink support in `sink_info`

Authors:
  - Devavret Makkar (https://github.com/devavret)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9810
rapids-bot bot pushed a commit that referenced this issue Jan 10, 2022
Makes use of the efficient partitioned writing support added in #9810 to improve performance of partitioned parquet dataset writing.

Closes #5059

Authors:
  - Devavret Makkar (https://github.com/devavret)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Richard (Rick) Zamora (https://github.com/rjzamora)

URL: #9971
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants