[FEA] Retry support for chunked ORC writer #12792

abellina · 2023-02-16T16:38:38Z

The spark-rapids team has been working on a framework to retry GPU operations on failure, like Out Of Memory Error (OOM) NVIDIA/spark-rapids#7252. We have seen OOM issues while writing ORC to begin with, but we also want to handle Parquet writes as well in a retriable fashion.

At a very high level, we are planning to handle failures like OOM exception from RMM, but we are planning to possibly use this generic framework to handle other types of failures in the future, and attempt to handle it in different ways.

Some retry attempts will be to throttle our tasks down. We would attempt the cuDF call with the same table we had previously, but we are reducing the number of threads that are allowed to call into cuDF concurrently. We believe this throttling work will help us serialize the memory usage, so we let one of the Spark tasks take as much memory as needed.
In other cases we also have a "split and retry" approach, in the case where we think a smaller chunk may be beneficial (if we are handling an OOM, or we have hit a cuDF column limit for instance). In this case, we are going to split the data smartly and re-attempt the cuDF call with smaller chunks.

When calling into the cuDF writers, we need to make sure that they are idempotent. What we are looking from the cuDF team is to guarantee that this is possible today (and therefore the API needs to be documented), or discuss what may need to be changed in order to support this retry mechanism.

Tasks

Give feedback

No tasks being tracked yet.

Options

mythrocks · 2023-02-17T22:37:48Z

Watching this also.

When calling into the cuDF writers, we need to make sure that they are idempotent.

I suspect we might need more detail. Specifically at what level the retries are expected.

The chunked writers are stateful, so I suspect there are currently no guarantees that the output buffers aren't modified at all when the nth chunk fails its write.

We might well be ok if the retries are not per-chunk, but per output-buffer. (i.e. The task retries all chunks/column-vectors for a given output-buffer.)

That said, I'm likely wildly off base. I'd be keen to see the IO experts' analysis.

abellina · 2023-02-17T22:53:37Z

The approach we've been discussing is at the chunk level, the assumption being that the caller has passed say N-1 chunks already, but the Nth one is having trouble. The plan would be to first retry this chunk (since perhaps it was a temporary issue, like memory fragmentation, that caused the failure), and if the retries are unsuccessful, the chunk would be broken into smaller chunks, which are each in turn attempted and retried.

As far as I understand, a sink is created which receives the orc-encoded chunks as host memory. In my opinion, the caller can hold on to the host encoded buffers and decide what to do with them, as long as the writer can be given the failing chunk and it can continue to produce orc-encoded chunks, without corrupting metadata that is later written in the footer.

The way I imagine this is if the chunk write function is at the stage where it is guaranteed to succeed, then by all means persist metadata away (we are good with this particular chunk).

GregoryKimball · 2023-02-20T22:49:25Z

Thank you for raising this discussion. I don't believe that the failure behavior of libcudf chunked writers is well-documented or well-tested. We'll need to start with a reproducible example and investigate from there.

revans2 · 2023-02-21T16:09:40Z

@GregoryKimball I am not 100% sure what you mean by a reproducible example. Could you clarify if you want us to write test cases where the current chunked writer would run out of memory so you can see it happening and if it looks like it works?

In offline discussions we had talked about exposing a way to make a copy of the internal state of the writer (really just the metrics that are cached for the footer) and then using that in the case of a failure. The testing would then be verifying that we can roll back in a number of different situations.

This refactors the class `cudf::io::orc::ProtobufWriter`, making it independent from the ORC writer buffer. From now, a new instance of `ProtobufWriter` will work on its own buffer. That avoids touching the ORC writer's internal states. The PR is part of solution for #12792. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec URL: #12877

The current ORC chunked writer performs compressing/encoding and writing data into the output data sink without any safeguard. This PR modifies the internal `writer::impl::write()` function, separating it into multiple pieces: * A free function that performs compressing/encoding the input table into intermediate results. These intermediate results are totally independent of the writer. As such, the writer can be isolated from failures of this free function, allowing to retry upon failure. * After having the intermediate results in the previous step, these results will be actually applied to the output data sink to start the actual data writing. Some cleanup is also performed on the existing code. That includes moving some member functions into free functions, which helps reducing potential dependencies between translation units. There is no new implementation added in this work. Only the existing code is moved around. Partially contributes to #12792. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Vyas Ramasubramani (https://github.com/vyasr) URL: #12949

ttnghia · 2023-03-30T18:17:43Z

Close as completed by #12949.

abellina added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue Spark Functionality that helps Spark RAPIDS Reliability labels Feb 16, 2023

abellina mentioned this issue Feb 16, 2023

[BUG] Leverage OOM retry framework for ORC writes NVIDIA/spark-rapids#7341

Closed

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Feb 20, 2023

mattahrens assigned ttnghia Feb 23, 2023

ttnghia mentioned this issue Mar 2, 2023

Refactor io::orc::ProtobufWriter #12877

Merged

ttnghia mentioned this issue Mar 17, 2023

Refactor orc chunked writer #12949

Merged

ttnghia changed the title ~~[FEA] Retry guarantees for chunked orc and parquet writes~~ [FEA] Retry support for chunked orc writer Mar 30, 2023

ttnghia closed this as completed Mar 30, 2023

ttnghia changed the title ~~[FEA] Retry support for chunked orc writer~~ [FEA] Retry support for chunked ORC writer Mar 30, 2023

ttnghia mentioned this issue Mar 30, 2023

[FEA] Retry support for chunked Parquet writer #13042

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Retry support for chunked ORC writer #12792

[FEA] Retry support for chunked ORC writer #12792

abellina commented Feb 16, 2023 •

edited by ttnghia

Loading

Tasks

mythrocks commented Feb 17, 2023

abellina commented Feb 17, 2023

GregoryKimball commented Feb 20, 2023

revans2 commented Feb 21, 2023

ttnghia commented Mar 30, 2023

[FEA] Retry support for chunked ORC writer #12792

[FEA] Retry support for chunked ORC writer #12792

Comments

abellina commented Feb 16, 2023 • edited by ttnghia Loading

Tasks

mythrocks commented Feb 17, 2023

abellina commented Feb 17, 2023

GregoryKimball commented Feb 20, 2023

revans2 commented Feb 21, 2023

ttnghia commented Mar 30, 2023

abellina commented Feb 16, 2023 •

edited by ttnghia

Loading