Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Partitioned writes release GPU semaphore with unspillable GPU memory #6980

Closed
jlowe opened this issue Nov 2, 2022 · 10 comments · Fixed by #8385 or #8667
Closed

[BUG] Partitioned writes release GPU semaphore with unspillable GPU memory #6980

jlowe opened this issue Nov 2, 2022 · 10 comments · Fixed by #8385 or #8667
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@jlowe
Copy link
Contributor

jlowe commented Nov 2, 2022

Currently partitioned writes work with the following algorithm:

  • Compute the split indices for the different partitions in the batch
  • contiguous split the batch
  • foreach contiguous table:
    • write out the table

The problem with this algorithm is that currently the "write out the table" step consists of the following steps internally:

  • encode the data on the GPU
  • transfer the encoded data to the host
  • release the GPU semaphore (to avoid holding it while doing a potentially very slow, non-GPU operation in the next step)
  • write the encoded data to the distributed filesystem

This means that if there's at least two partitions, we end up releasing the GPU semaphore with unspillable GPU memory still allocated. This can allow other tasks to start allocating on the GPU and potentially run out of GPU memory due to the inability to free up GPU memory from a task not holding the semaphore.

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 2, 2022
@jlowe
Copy link
Contributor Author

jlowe commented Nov 2, 2022

We need to split up the write logic into discreet methods for encoding the data vs. flushing the data to the filesystem so the writers can be more intelligent about the handling of GPU memory and the semaphore. For example, if we're doing a simple (non-dynamic) partitioned write, it would be better to do an algorithm like:

  • Split the table into partitions on the GPU
  • For each table split:
    • encode the split on the GPU
    • transfer the encoded data to the host
    • free the encoded data on the GPU
  • Release the GPU semaphore
  • For each encoded split in host memory:
    • write the encoded split to the distributed filesystem

@HaoYang670
Copy link
Collaborator

cc @res-life, could you please take a look?

@HaoYang670
Copy link
Collaborator

HaoYang670 commented Nov 3, 2022

Hi @jlowe, is it possible to use 2 or more threads, one encodes the data from GPU to host, others concurrently writes the data from host to storage?

For example, could we build a 3-stage pipeline? And we release the GPU semaphore after thread 1 copying all the data to host memory?
WechatIMG6

@res-life
Copy link
Collaborator

res-life commented Nov 3, 2022

Related code:
https://github.com/NVIDIA/spark-rapids/blob/branch-22.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala#L144

GpuSemaphore.releaseIfNecessary(TaskContext.get)

Patterns like the following can be rewritten to what Jason said.

splits = table.contiguousSplit()
for each split in splits:
  writer.write(split)

@res-life
Copy link
Collaborator

res-life commented Nov 3, 2022

@jlowe GpuParquetWriter is responsible to write a Table to disk via JNI ParquetTableWriter.
IIUC, the encoding and writing the encoded data are both on the cuDF side.
So this issue requires an cuDF PR to support?

@res-life
Copy link
Collaborator

res-life commented Nov 3, 2022

Sorry I did not notice the consumer, do not need cuDF PR.

private ParquetTableWriter(ParquetWriterOptions options, HostBufferConsumer consumer) {

We can just collect all the HostBuffer before writing to the distributed file system.

@jlowe
Copy link
Contributor Author

jlowe commented Nov 3, 2022

is it possible to use 2 or more threads, one encodes the data from GPU to host, others concurrently writes the data from host to storage?

Yes, although this is not required to fix the problem reported here and is more complex. IMO we should focus on fixing the problem first, and then worry about performance optimizations in followup PRs.

@abellina
Copy link
Collaborator

I have started looking into this and will provide more update later after syncing with @jlowe. For starters, we are keeping onto original pre-contig-split tables even while retrying (so similar to some of the window bugs that Andy G has fixed). So we are doubling up on memory at least in GpuSingleDirectoryDataWriter and in GpuDynamicPartitionDataConcurrentWriter, I should have a PR for these two before tackling the semaphore issue.

High level, I am seeing various instances of: cuDF table -> ColumnarBatch -> cudDF table -> ColumnarBatch that I know can be fixed, especially with #8262.

Other than that, copying to host after encoding on the GPU, then releasing the semaphore once the batch is done seems reasonable.

@abellina
Copy link
Collaborator

I am blocked a bit until #8243 gets resolved. I need to refactor the same code that @andygrove is touching in his PR in order to implement memory reducing changes and changes where every batch is spillable at a higher level.

@abellina
Copy link
Collaborator

Part of this issue was improved in #8385, by closing tables/batches before going to the writers and handing them off to be closed at write time.

The core of this issue, where the semaphore is released whilst holding partitioned chunks in memory is still an issue and will be addressed in 23.08, where this issue is getting moved to as a P1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
5 participants