Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Retry support for chunked ORC writer #12792

Closed
abellina opened this issue Feb 16, 2023 · 5 comments
Closed

[FEA] Retry support for chunked ORC writer #12792

abellina opened this issue Feb 16, 2023 · 5 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@abellina
Copy link
Contributor

abellina commented Feb 16, 2023

The spark-rapids team has been working on a framework to retry GPU operations on failure, like Out Of Memory Error (OOM) NVIDIA/spark-rapids#7252. We have seen OOM issues while writing ORC to begin with, but we also want to handle Parquet writes as well in a retriable fashion.

At a very high level, we are planning to handle failures like OOM exception from RMM, but we are planning to possibly use this generic framework to handle other types of failures in the future, and attempt to handle it in different ways.

  • Some retry attempts will be to throttle our tasks down. We would attempt the cuDF call with the same table we had previously, but we are reducing the number of threads that are allowed to call into cuDF concurrently. We believe this throttling work will help us serialize the memory usage, so we let one of the Spark tasks take as much memory as needed.

  • In other cases we also have a "split and retry" approach, in the case where we think a smaller chunk may be beneficial (if we are handling an OOM, or we have hit a cuDF column limit for instance). In this case, we are going to split the data smartly and re-attempt the cuDF call with smaller chunks.

When calling into the cuDF writers, we need to make sure that they are idempotent. What we are looking from the cuDF team is to guarantee that this is possible today (and therefore the API needs to be documented), or discuss what may need to be changed in order to support this retry mechanism.

Tasks

Preview Give feedback
No tasks being tracked yet.
@abellina abellina added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue Spark Functionality that helps Spark RAPIDS Reliability labels Feb 16, 2023
@mythrocks
Copy link
Contributor

Watching this also.

When calling into the cuDF writers, we need to make sure that they are idempotent.

I suspect we might need more detail. Specifically at what level the retries are expected.

The chunked writers are stateful, so I suspect there are currently no guarantees that the output buffers aren't modified at all when the nth chunk fails its write.

We might well be ok if the retries are not per-chunk, but per output-buffer. (i.e. The task retries all chunks/column-vectors for a given output-buffer.)

That said, I'm likely wildly off base. I'd be keen to see the IO experts' analysis.

@abellina
Copy link
Contributor Author

The approach we've been discussing is at the chunk level, the assumption being that the caller has passed say N-1 chunks already, but the Nth one is having trouble. The plan would be to first retry this chunk (since perhaps it was a temporary issue, like memory fragmentation, that caused the failure), and if the retries are unsuccessful, the chunk would be broken into smaller chunks, which are each in turn attempted and retried.

As far as I understand, a sink is created which receives the orc-encoded chunks as host memory. In my opinion, the caller can hold on to the host encoded buffers and decide what to do with them, as long as the writer can be given the failing chunk and it can continue to produce orc-encoded chunks, without corrupting metadata that is later written in the footer.

The way I imagine this is if the chunk write function is at the stage where it is guaranteed to succeed, then by all means persist metadata away (we are good with this particular chunk).

@GregoryKimball
Copy link
Contributor

Thank you for raising this discussion. I don't believe that the failure behavior of libcudf chunked writers is well-documented or well-tested. We'll need to start with a reproducible example and investigate from there.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Feb 20, 2023
@revans2
Copy link
Contributor

revans2 commented Feb 21, 2023

@GregoryKimball I am not 100% sure what you mean by a reproducible example. Could you clarify if you want us to write test cases where the current chunked writer would run out of memory so you can see it happening and if it looks like it works?

In offline discussions we had talked about exposing a way to make a copy of the internal state of the writer (really just the metrics that are cached for the footer) and then using that in the case of a failure. The testing would then be verifying that we can roll back in a number of different situations.

rapids-bot bot pushed a commit that referenced this issue Mar 14, 2023
This refactors the class `cudf::io::orc::ProtobufWriter`, making it independent from the ORC writer buffer. From now, a new instance of `ProtobufWriter` will work on its own buffer. That avoids touching the ORC writer's internal states.

The PR is part of solution for #12792.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec

URL: #12877
rapids-bot bot pushed a commit that referenced this issue Mar 21, 2023
The current ORC chunked writer performs compressing/encoding and writing data into the output data sink without any safeguard. This PR modifies the internal `writer::impl::write()` function, separating it into multiple pieces:
 * A free function that performs compressing/encoding the input table into intermediate results. These intermediate results are totally independent of the writer. As such, the writer can be isolated from failures of this free function, allowing to retry upon failure.
 * After having the intermediate results in the previous step, these results will be actually applied to the output data sink to start the actual data writing.

Some cleanup is also performed on the existing code. That includes moving some member functions into free functions, which helps reducing potential dependencies between translation units.

There is no new implementation added in this work. Only the existing code is moved around.

Partially contributes to #12792.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #12949
@ttnghia ttnghia changed the title [FEA] Retry guarantees for chunked orc and parquet writes [FEA] Retry support for chunked orc writer Mar 30, 2023
@ttnghia
Copy link
Contributor

ttnghia commented Mar 30, 2023

Close as completed by #12949.

@ttnghia ttnghia closed this as completed Mar 30, 2023
@ttnghia ttnghia changed the title [FEA] Retry support for chunked orc writer [FEA] Retry support for chunked ORC writer Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

5 participants