-
Notifications
You must be signed in to change notification settings - Fork 242
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] whether the gpu semaphore
was acquired-released
at task-lifetime level
or batch-lifetime level
?
#835
Comments
Broadcasted tables are reused across tasks, left in GPU memory until ultimately garbage collected (so in that sense "intentionally leaked"). Currently broadcasted tables are not marked as spillable batches, but this is planned. I couldn't find a tracking ticket for it, so I filed #836. We haven't prioritized it since in most cases broadcast tables are quite small. No task "owns" a broadcast table, so there are no GPU sempahore semantics around them. The semaphore tends to be oriented around batch-lifetime, but that can cause problems when tasks are holding onto potentially large amounts of GPU memory, like a hash join table. We'd like to address this by making any batches being held beyond the current batch iteration be added to the spillable buffer framework and thus can be released if memory pressure is high. In general an issue with task-lifetime is that it can severely hurt performance relative to a CPU query when parts of a stage are on the GPU and parts are on the CPU. If we have the entire task lifetime covered by the semaphore, even the parts not on the GPU, then the parallelism of the CPU portions are limited by the parallelism allowed on the GPU which is often less than the parallelism achieved in a CPU-only query run. That's why we want to release the semaphore when a batch transitions off of the GPU, but in order to do so relatively safely we need to make sure all other buffers associated with the task still being held on the GPU during the CPU portion of the query are spillable in case that's needed to free GPU memory. |
So, does you mean case1 (BroadcastHashJoin) is batch-lifetime, case2 (ShuffledHashJoin) is task-lifetime? @jlowe |
They are both batch lifetime effectively. The semaphore is released when a batch percolating through the iterators hits a point where it leaves GPU memory and re-acquired when a new batch is created in GPU memory. Note that batch lifetime becomes task lifetime if the batch originate in GPU memory at the start of the task and remains on the GPU through the end of the task. |
When a task acquire the semaphore but not release it explicitly (i.e. implicitly release when the task complete) the batch-lifetime become task-lifetime. In what scenario will a task not release the semaphore explicitly ? @jlowe
I think in HashJoin, no matter shuffled or hashed, the streamed-table is processed batch-by-batch, and the joined results will finally leave gpu memory batch-by-batch, so, how the task know it should not relsease the semaphore in the shuffled hash ? (this is what i think, please correct me if wrong) |
There is a task completion action that ensures a task will always release the semaphore when it completes, so in that sense it will always be released. An example of where a task will not release the semaphore until the very end when it completes is when using the RAPIDS shuffle manager and everything the task does is on the GPU. In that scenario, task output are left on the GPU, so the semaphore should be left held until the task completes.
The general rule regarding the semaphore is as I stated before:
That means only points in the plan where we will take data that wasn't already on the GPU and instantiate a new batch of data in GPU memory need to acquire the semaphore. For example, data loaders (Parquet readers, etc.), legacy shuffle readers, and RowToColumnar transforms are some points where data is entering the GPU memory, so that's where we need to acquire the semaphore. Similarly, we only need to release the semaphore when data was on the GPU for an operation but leaves the GPU during/after the operation. For example, data writers (Parquet writer, etc.), legacy shuffle writer, and ColumnarToRow transforms are some points where data is leaving GPU memory, so that's where we need to release the semaphore. Operators that take an input batch on the GPU and produce an output batch on the GPU do not need to acquire the semaphore because it is assumed some operator involved in producing the input batch already acquired it as part of creating that batch. This is discussed in the developer guide. Therefore the join code, regardless of what kind of join it is, should never need to acquire or release the semaphore. It is not a point where batch data is entering or leaving GPU memory. |
this image the task will take the gpu semaphore in According the rule,the task will relesae the semaphore in Does this realy OK? @jlowe
I had seen the guide, and the guide is very clear. I also know which code line(s) do acquire or release in rapids. UPDATES: spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala Line 111 in 1a2b17e
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala Line 134 in 1a2b17e
In this topic, the acquire and release from my point is really release it, not dec the ref-count But this gab should not be a problem! |
Yes, that is what will happen, and no, it's not completely OK. The task will have released the semaphore but the build-side hash table is still sitting in GPU memory. Our plan to fix this is to have the build hash-map table registered as a spillable buffer, which ties into what I mentioned before:
With that change, when the semaphore is released and potentially some other task tries to add memory pressure to the GPU we could spill the batches being held by tasks that aren't actively holding onto the GPU semaphore if spilling that memory is needed. This would allow the additional parallelism without the risk of OOM (which we are currently risking today). This is one reason why the concurrent GPU threads defaults to 1 currently. |
So, the number of builded hash-map in gpu memory may be greater-than the GpuConcurrent specified by --conf option (In the worse case all task's build-map will be in gpu-memory)? |
1? does is 2? @jlowe |
i am more confused; When the because,after we had moved the deserialization code out of the scope of the gpu-semaphore,the disk/net io, decompress in the shuffle read will just like the |
Yes, there are many reasons how an OOM can happen with concurrent tasks. All the tasks may hit their maximum GPU memory pressure at the same time, also causing an OOM. Sometimes they don't line up and it doesn't OOM. That's the nature of running concurrent. We don't have a good plan for solving the simultaneous maximum pressure problem, but we do have a plan to solve the issue of leaving the build batch table in memory without the semaphore, and that's by making it automatically spillable. Build-side tables, like broadcast tables, are often small so we don't normally want to avoid releasing the semaphore because they're there. We'd rather let them spill if that's necessary to avoid an OOM. In the short term, we're recommending people run with less GPU concurrency if they are hitting an OOM condition with it. Even if we hold onto the GPU semaphore with the build batch being left as a "fix" it will not prevent the coinciding maximum memory pressure problem that can lead to sporadic OOMs, and it will actively harm queries where there is CPU-only processing after the join in the same stage (e.g.: some part of the plan that we can't place on the GPU, like an expensive project with UDF).
Yes, the default for concurrent GPU tasks is 1 as seen in the config definition.
This was covered by Bobby's comment where he said it would only benefit the very first batch, not subsequent batches. This should be visible in an nsys profile trace, I highly recommend using those traces to see when tasks are blocked on the GPU semaphore and when they acquire/release it. There are NVTX ranges used when acquiring the semaphore and when releasing it (the latter being very small since it doesn't block to release). See the NVTX profiling doc in the developer docs section for more details. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Two cases
Case1
Scan(a big table)
-> [GpuFilter
,GpuCoalesceBatches
,GpuProcduct
. ...] ->GpuBroadcastHashJoin(with another small table)
-> [GpuFilter
,GpuHashAggregate
,GpuColumnarExchange
, ...]In case1, the small table was broadcasted to each executor, and the whole table should be in gpu memory, i want to know:
Case2
GpuColumnarExchange
->GpuCoalesceBatches
->GpuShuffledHashJoin(with another big table)
-> [GpuFilter
,GpuHashAggregate
,GpuColumnarExchange
, ...]I think, in case2, the
GpuShuffledHashJoin
will build a task-level hash-map and this hash-map should stay in gpu memory until all the batches from the streamed table had been processed. so once a task had acquired the semaphore, it will hold it until the task had finished, so this istask-lifetime level
acquire-release (this is only my thought, please correct me if i was wrong)@revans2 @jlowe thanks.
The text was updated successfully, but these errors were encountered: