Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Remove all batch processing code in parquet writer #13440

Closed
ttnghia opened this issue May 25, 2023 · 2 comments · Fixed by #15528
Closed

[FEA] Remove all batch processing code in parquet writer #13440

ttnghia opened this issue May 25, 2023 · 2 comments · Fixed by #15528
Assignees
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@ttnghia
Copy link
Contributor

ttnghia commented May 25, 2023

As batch processing is no longer supported (#13438), we should remove all the relevant code from parquet writer.

Note that we can also optimize memory usage to some extent, such as keeping only the memory blocks that is needed for writing to sink (either compressed or uncompressed), instead of keeping all.

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels May 25, 2023
@ttnghia ttnghia self-assigned this May 25, 2023
@GregoryKimball
Copy link
Contributor

Thank you @ttnghia for raising this issue, and also your focused work on #13438.

Let's please revisit this during 23.08 - for now I'll summarize the Slack discussion:

We may want to maintain the batching code as useful step towards pipelining the parquet writer. In this context, "pipelining" refers to processing batches of data across multiple CUDA streams and overlapping compute with IO. The goal of pipelining is increased utilization and reduced end-to-end runtime.

Removing batching for the chunked writer makes sense, because in this case we expect the user application to catch OOM errors and retry with smaller table segments. For the non-chunked/standard writer the batching code could be useful for providing lower peak memory usage.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jun 26, 2023
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
@GregoryKimball
Copy link
Contributor

After further discussion, we have committed to removing the batching code because it complicates applications that rely on catch-and-retry behavior in the writer APIs. We will maintain the chunked writer API and removing the internal batching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants