-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Partitioned writes release GPU semaphore with unspillable GPU memory #6980
Comments
We need to split up the write logic into discreet methods for encoding the data vs. flushing the data to the filesystem so the writers can be more intelligent about the handling of GPU memory and the semaphore. For example, if we're doing a simple (non-dynamic) partitioned write, it would be better to do an algorithm like:
|
cc @res-life, could you please take a look? |
Hi @jlowe, is it possible to use 2 or more threads, one encodes the data from GPU to host, others concurrently writes the data from host to storage? For example, could we build a 3-stage pipeline? And we release the GPU semaphore after |
Patterns like the following can be rewritten to what Jason said.
|
@jlowe |
Sorry I did not notice the
We can just collect all the |
Yes, although this is not required to fix the problem reported here and is more complex. IMO we should focus on fixing the problem first, and then worry about performance optimizations in followup PRs. |
I have started looking into this and will provide more update later after syncing with @jlowe. For starters, we are keeping onto original pre-contig-split tables even while retrying (so similar to some of the window bugs that Andy G has fixed). So we are doubling up on memory at least in High level, I am seeing various instances of: cuDF table -> ColumnarBatch -> cudDF table -> ColumnarBatch that I know can be fixed, especially with #8262. Other than that, copying to host after encoding on the GPU, then releasing the semaphore once the batch is done seems reasonable. |
I am blocked a bit until #8243 gets resolved. I need to refactor the same code that @andygrove is touching in his PR in order to implement memory reducing changes and changes where every batch is spillable at a higher level. |
Part of this issue was improved in #8385, by closing tables/batches before going to the writers and handing them off to be closed at write time. The core of this issue, where the semaphore is released whilst holding partitioned chunks in memory is still an issue and will be addressed in 23.08, where this issue is getting moved to as a P1. |
Currently partitioned writes work with the following algorithm:
The problem with this algorithm is that currently the "write out the table" step consists of the following steps internally:
This means that if there's at least two partitions, we end up releasing the GPU semaphore with unspillable GPU memory still allocated. This can allow other tasks to start allocating on the GPU and potentially run out of GPU memory due to the inability to free up GPU memory from a task not holding the semaphore.
The text was updated successfully, but these errors were encountered: