[BUG] Partitioned writes release GPU semaphore with unspillable GPU memory #6980

jlowe · 2022-11-02T14:44:52Z

Currently partitioned writes work with the following algorithm:

Compute the split indices for the different partitions in the batch
contiguous split the batch
foreach contiguous table:
- write out the table

The problem with this algorithm is that currently the "write out the table" step consists of the following steps internally:

encode the data on the GPU
transfer the encoded data to the host
release the GPU semaphore (to avoid holding it while doing a potentially very slow, non-GPU operation in the next step)
write the encoded data to the distributed filesystem

This means that if there's at least two partitions, we end up releasing the GPU semaphore with unspillable GPU memory still allocated. This can allow other tasks to start allocating on the GPU and potentially run out of GPU memory due to the inability to free up GPU memory from a task not holding the semaphore.

jlowe · 2022-11-02T14:52:48Z

We need to split up the write logic into discreet methods for encoding the data vs. flushing the data to the filesystem so the writers can be more intelligent about the handling of GPU memory and the semaphore. For example, if we're doing a simple (non-dynamic) partitioned write, it would be better to do an algorithm like:

Split the table into partitions on the GPU
For each table split:
- encode the split on the GPU
- transfer the encoded data to the host
- free the encoded data on the GPU
Release the GPU semaphore
For each encoded split in host memory:
- write the encoded split to the distributed filesystem

HaoYang670 · 2022-11-03T05:03:35Z

cc @res-life, could you please take a look?

HaoYang670 · 2022-11-03T07:55:14Z

Hi @jlowe, is it possible to use 2 or more threads, one encodes the data from GPU to host, others concurrently writes the data from host to storage?

For example, could we build a 3-stage pipeline? And we release the GPU semaphore after thread 1 copying all the data to host memory?

res-life · 2022-11-03T07:57:23Z

Related code:
https://github.com/NVIDIA/spark-rapids/blob/branch-22.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala#L144

GpuSemaphore.releaseIfNecessary(TaskContext.get)

Patterns like the following can be rewritten to what Jason said.

splits = table.contiguousSplit()
for each split in splits:
  writer.write(split)

res-life · 2022-11-03T08:48:08Z

@jlowe GpuParquetWriter is responsible to write a Table to disk via JNI ParquetTableWriter.
IIUC, the encoding and writing the encoded data are both on the cuDF side.
So this issue requires an cuDF PR to support?

res-life · 2022-11-03T09:54:48Z

Sorry I did not notice the consumer, do not need cuDF PR.

private ParquetTableWriter(ParquetWriterOptions options, HostBufferConsumer consumer) {

We can just collect all the HostBuffer before writing to the distributed file system.

jlowe · 2022-11-03T13:46:01Z

is it possible to use 2 or more threads, one encodes the data from GPU to host, others concurrently writes the data from host to storage?

Yes, although this is not required to fix the problem reported here and is more complex. IMO we should focus on fixing the problem first, and then worry about performance optimizations in followup PRs.

abellina · 2023-05-12T15:11:14Z

I have started looking into this and will provide more update later after syncing with @jlowe. For starters, we are keeping onto original pre-contig-split tables even while retrying (so similar to some of the window bugs that Andy G has fixed). So we are doubling up on memory at least in GpuSingleDirectoryDataWriter and in GpuDynamicPartitionDataConcurrentWriter, I should have a PR for these two before tackling the semaphore issue.

High level, I am seeing various instances of: cuDF table -> ColumnarBatch -> cudDF table -> ColumnarBatch that I know can be fixed, especially with #8262.

Other than that, copying to host after encoding on the GPU, then releasing the semaphore once the batch is done seems reasonable.

abellina · 2023-05-16T21:55:30Z

I am blocked a bit until #8243 gets resolved. I need to refactor the same code that @andygrove is touching in his PR in order to implement memory reducing changes and changes where every batch is spillable at a higher level.

abellina · 2023-05-31T13:40:45Z

Part of this issue was improved in #8385, by closing tables/batches before going to the writers and handing them off to be closed at write time.

The core of this issue, where the semaphore is released whilst holding partitioned chunks in memory is still an issue and will be addressed in 23.08, where this issue is getting moved to as a P1.

jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 2, 2022

jlowe mentioned this issue Nov 2, 2022

Close the batch in the writeBatch function of GpuDynamicPartitionDataSingleWriter #6977

Merged

jlowe mentioned this issue Nov 14, 2022

Clean the try ... finally ... blocks in the GpuFileFormatDataWriter #6735

Closed

abellina mentioned this issue Feb 10, 2023

[BUG] Active GPU thread not holding the semaphore #7729

Closed

mattahrens assigned abellina Apr 28, 2023

firestarman added bug Something isn't working and removed bug Something isn't working labels May 10, 2023

abellina mentioned this issue May 24, 2023

Reduce memory usage in GpuFileFormatDataWriter and GpuDynamicPartitionDataConcurrentWriter #8385

Merged

abellina linked a pull request May 24, 2023 that will close this issue

Reduce memory usage in GpuFileFormatDataWriter and GpuDynamicPartitionDataConcurrentWriter #8385

Merged

abellina closed this as completed in #8385 May 31, 2023

abellina reopened this May 31, 2023

abellina mentioned this issue Jul 6, 2023

Make state spillable in partitioned writer [databricks] #8667

Merged

abellina closed this as completed in #8667 Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Partitioned writes release GPU semaphore with unspillable GPU memory #6980

[BUG] Partitioned writes release GPU semaphore with unspillable GPU memory #6980

jlowe commented Nov 2, 2022

jlowe commented Nov 2, 2022

HaoYang670 commented Nov 3, 2022

HaoYang670 commented Nov 3, 2022 •

edited

Loading

res-life commented Nov 3, 2022

res-life commented Nov 3, 2022

res-life commented Nov 3, 2022

jlowe commented Nov 3, 2022

abellina commented May 12, 2023

abellina commented May 16, 2023

abellina commented May 31, 2023

[BUG] Partitioned writes release GPU semaphore with unspillable GPU memory #6980

[BUG] Partitioned writes release GPU semaphore with unspillable GPU memory #6980

Comments

jlowe commented Nov 2, 2022

jlowe commented Nov 2, 2022

HaoYang670 commented Nov 3, 2022

HaoYang670 commented Nov 3, 2022 • edited Loading

res-life commented Nov 3, 2022

res-life commented Nov 3, 2022

res-life commented Nov 3, 2022

jlowe commented Nov 3, 2022

abellina commented May 12, 2023

abellina commented May 16, 2023

abellina commented May 31, 2023

HaoYang670 commented Nov 3, 2022 •

edited

Loading