[FEA] Remove all batch processing code in parquet writer #13440

ttnghia · 2023-05-25T00:21:37Z

As batch processing is no longer supported (#13438), we should remove all the relevant code from parquet writer.

Note that we can also optimize memory usage to some extent, such as keeping only the memory blocks that is needed for writing to sink (either compressed or uncompressed), instead of keeping all.

GregoryKimball · 2023-05-25T02:44:35Z

Thank you @ttnghia for raising this issue, and also your focused work on #13438.

Let's please revisit this during 23.08 - for now I'll summarize the Slack discussion:

We may want to maintain the batching code as useful step towards pipelining the parquet writer. In this context, "pipelining" refers to processing batches of data across multiple CUDA streams and overlapping compute with IO. The goal of pipelining is increased utilization and reduced end-to-end runtime.

Removing batching for the chunked writer makes sense, because in this case we expect the user application to catch OOM errors and retry with smaller table segments. For the non-chunked/standard writer the batching code could be useful for providing lower peak memory usage.

GregoryKimball · 2024-02-26T19:41:16Z

After further discussion, we have committed to removing the batching code because it complicates applications that rely on catch-and-retry behavior in the writer APIs. We will maintain the chunked writer API and removing the internal batching.

ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels May 25, 2023

ttnghia self-assigned this May 25, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jun 26, 2023

GregoryKimball added this to libcudf Jun 26, 2023

GregoryKimball added this to the Parquet continuous improvement milestone Jun 26, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

mhaseeb123 mentioned this issue Apr 13, 2024

Removing all batching code from parquet writer #15528

Merged

3 tasks

mhaseeb123 assigned mhaseeb123 and unassigned ttnghia Apr 13, 2024

rapids-bot bot closed this as completed in #15528 Apr 16, 2024

rapids-bot bot closed this as completed in 61e116e Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Remove all batch processing code in parquet writer #13440

[FEA] Remove all batch processing code in parquet writer #13440

ttnghia commented May 25, 2023

GregoryKimball commented May 25, 2023

GregoryKimball commented Feb 26, 2024

[FEA] Remove all batch processing code in parquet writer #13440

[FEA] Remove all batch processing code in parquet writer #13440

Comments

ttnghia commented May 25, 2023

GregoryKimball commented May 25, 2023

GregoryKimball commented Feb 26, 2024