Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] whether the gpu semaphore was acquired-released at task-lifetime level or batch-lifetime level ? #835

Closed
JustPlay opened this issue Sep 23, 2020 · 11 comments
Labels
question Further information is requested

Comments

@JustPlay
Copy link

JustPlay commented Sep 23, 2020

Two cases

Case1
Scan(a big table) -> [GpuFilter, GpuCoalesceBatches, GpuProcduct. ...] -> GpuBroadcastHashJoin(with another small table) -> [GpuFilter, GpuHashAggregate, GpuColumnarExchange, ...]

In case1, the small table was broadcasted to each executor, and the whole table should be in gpu memory, i want to know:

  1. Whether the broadcasted table will be shared by all tasks (in the same executor) runing on gpu OR all the tasks will each has a copy in gpu memory?
  2. Whether the task release it's gpu semaphore when each batch from the bigger table was processed and the result is outing from the gpu memory (batch-lifetime acquire-release) OR it will hold it until all the batches from the bigger table were processed and outed from gpu memory (task-lifetime acqiure-release)?

Case2
GpuColumnarExchange -> GpuCoalesceBatches -> GpuShuffledHashJoin(with another big table) -> [GpuFilter, GpuHashAggregate, GpuColumnarExchange, ...]

I think, in case2, the GpuShuffledHashJoin will build a task-level hash-map and this hash-map should stay in gpu memory until all the batches from the streamed table had been processed. so once a task had acquired the semaphore, it will hold it until the task had finished, so this is task-lifetime level acquire-release (this is only my thought, please correct me if i was wrong)

@revans2 @jlowe thanks.

@JustPlay JustPlay added ? - Needs Triage Need team to review and classify question Further information is requested labels Sep 23, 2020
@jlowe
Copy link
Contributor

jlowe commented Sep 23, 2020

Whether the broadcasted table will be shared by all tasks (in the same executor) runing on gpu OR all the tasks will each has a copy in gpu memory?

Broadcasted tables are reused across tasks, left in GPU memory until ultimately garbage collected (so in that sense "intentionally leaked"). Currently broadcasted tables are not marked as spillable batches, but this is planned. I couldn't find a tracking ticket for it, so I filed #836. We haven't prioritized it since in most cases broadcast tables are quite small. No task "owns" a broadcast table, so there are no GPU sempahore semantics around them.

The semaphore tends to be oriented around batch-lifetime, but that can cause problems when tasks are holding onto potentially large amounts of GPU memory, like a hash join table. We'd like to address this by making any batches being held beyond the current batch iteration be added to the spillable buffer framework and thus can be released if memory pressure is high.

In general an issue with task-lifetime is that it can severely hurt performance relative to a CPU query when parts of a stage are on the GPU and parts are on the CPU. If we have the entire task lifetime covered by the semaphore, even the parts not on the GPU, then the parallelism of the CPU portions are limited by the parallelism allowed on the GPU which is often less than the parallelism achieved in a CPU-only query run. That's why we want to release the semaphore when a batch transitions off of the GPU, but in order to do so relatively safely we need to make sure all other buffers associated with the task still being held on the GPU during the CPU portion of the query are spillable in case that's needed to free GPU memory.

@JustPlay
Copy link
Author

Whether the broadcasted table will be shared by all tasks (in the same executor) runing on gpu OR all the tasks will each has a copy in gpu memory?

Broadcasted tables are reused across tasks, left in GPU memory until ultimately garbage collected (so in that sense "intentionally leaked"). Currently broadcasted tables are not marked as spillable batches, but this is planned. I couldn't find a tracking ticket for it, so I filed #836. We haven't prioritized it since in most cases broadcast tables are quite small. No task "owns" a broadcast table, so there are no GPU sempahore semantics around them.

The semaphore tends to be oriented around batch-lifetime, but that can cause problems when tasks are holding onto potentially large amounts of GPU memory, like a hash join table. We'd like to address this by making any batches being held beyond the current batch iteration be added to the spillable buffer framework and thus can be released if memory pressure is high.

In general an issue with task-lifetime is that it can severely hurt performance relative to a CPU query when parts of a stage are on the GPU and parts are on the CPU. If we have the entire task lifetime covered by the semaphore, even the parts not on the GPU, then the parallelism of the CPU portions are limited by the parallelism allowed on the GPU which is often less than the parallelism achieved in a CPU-only query run. That's why we want to release the semaphore when a batch transitions off of the GPU, but in order to do so relatively safely we need to make sure all other buffers associated with the task still being held on the GPU during the CPU portion of the query are spillable in case that's needed to free GPU memory.

So, does you mean case1 (BroadcastHashJoin) is batch-lifetime, case2 (ShuffledHashJoin) is task-lifetime? @jlowe

@jlowe
Copy link
Contributor

jlowe commented Sep 23, 2020

They are both batch lifetime effectively. The semaphore is released when a batch percolating through the iterators hits a point where it leaves GPU memory and re-acquired when a new batch is created in GPU memory. Note that batch lifetime becomes task lifetime if the batch originate in GPU memory at the start of the task and remains on the GPU through the end of the task.

@JustPlay
Copy link
Author

JustPlay commented Sep 23, 2020

Note that batch lifetime becomes task lifetime if the batch originate in GPU memory at the start of the task and remains on the GPU through the end of the task

When a task acquire the semaphore but not release it explicitly (i.e. implicitly release when the task complete) the batch-lifetime become task-lifetime. In what scenario will a task not release the semaphore explicitly ? @jlowe

They are both batch lifetime effectively. The semaphore is released when a batch percolating through the iterators hits a point where it leaves GPU memory and re-acquired when a new batch is created in GPU memory

I think in HashJoin, no matter shuffled or hashed, the streamed-table is processed batch-by-batch, and the joined results will finally leave gpu memory batch-by-batch, so, how the task know it should not relsease the semaphore in the shuffled hash ? (this is what i think, please correct me if wrong)

@jlowe
Copy link
Contributor

jlowe commented Sep 23, 2020

In what scenario will a task not release the semaphore explicitly ?

There is a task completion action that ensures a task will always release the semaphore when it completes, so in that sense it will always be released. An example of where a task will not release the semaphore until the very end when it completes is when using the RAPIDS shuffle manager and everything the task does is on the GPU. In that scenario, task output are left on the GPU, so the semaphore should be left held until the task completes.

how the task know it should not relsease the semaphore in the shuffled hash ?

The general rule regarding the semaphore is as I stated before:

The semaphore is released when a batch percolating through the iterators hits a point where it leaves GPU memory and re-acquired when a new batch is created in GPU memory.

That means only points in the plan where we will take data that wasn't already on the GPU and instantiate a new batch of data in GPU memory need to acquire the semaphore. For example, data loaders (Parquet readers, etc.), legacy shuffle readers, and RowToColumnar transforms are some points where data is entering the GPU memory, so that's where we need to acquire the semaphore. Similarly, we only need to release the semaphore when data was on the GPU for an operation but leaves the GPU during/after the operation. For example, data writers (Parquet writer, etc.), legacy shuffle writer, and ColumnarToRow transforms are some points where data is leaving GPU memory, so that's where we need to release the semaphore. Operators that take an input batch on the GPU and produce an output batch on the GPU do not need to acquire the semaphore because it is assumed some operator involved in producing the input batch already acquired it as part of creating that batch. This is discussed in the developer guide.

Therefore the join code, regardless of what kind of join it is, should never need to acquire or release the semaphore. It is not a point where batch data is entering or leaving GPU memory.

@JustPlay
Copy link
Author

JustPlay commented Sep 23, 2020

img

this image

the task will take the gpu semaphore in (1) or (2) depends on which is the build side, then do the (3) ShuffledHashJoin ,finally do legacy shuffle write in (4) GpuColumnarExchange.

According the rule,the task will relesae the semaphore in (4) when one of the streamed-batch is shuffle writed, but the builded hash-map may still in gpu-memory

Does this realy OK? @jlowe

. This is discussed in the developer guide.

I had seen the guide, and the guide is very clear. I also know which code line(s) do acquire or release in rapids.
But i just can not fully understand what will happen at runtime

UPDATES:
I think one of the gap between you and me is that the GpuSemphore has a reference count.


In this topic, the acquire and release from my point is really release it, not dec the ref-count

But this gab should not be a problem!

@jlowe
Copy link
Contributor

jlowe commented Sep 23, 2020

According the rule,the task will relesae the semaphore in (4) when one of the streamed-batch is shuffle writed, but the builded hash-map may still in gpu-memory. Does this realy OK?

Yes, that is what will happen, and no, it's not completely OK. The task will have released the semaphore but the build-side hash table is still sitting in GPU memory.

Our plan to fix this is to have the build hash-map table registered as a spillable buffer, which ties into what I mentioned before:

We'd like to address this by making any batches being held beyond the current batch iteration be added to the spillable buffer framework and thus can be released if memory pressure is high.

With that change, when the semaphore is released and potentially some other task tries to add memory pressure to the GPU we could spill the batches being held by tasks that aren't actively holding onto the GPU semaphore if spilling that memory is needed. This would allow the additional parallelism without the risk of OOM (which we are currently risking today). This is one reason why the concurrent GPU threads defaults to 1 currently.

@JustPlay
Copy link
Author

JustPlay commented Sep 23, 2020

The task will have released the semaphore but the build-side hash table is still sitting in GPU memory.

So, the number of builded hash-map in gpu memory may be greater-than the GpuConcurrent specified by --conf option (In the worse case all task's build-map will be in gpu-memory)?
if so, this will be one of the OOM reason @jlowe

@JustPlay
Copy link
Author

This is one reason why the concurrent GPU threads defaults to 1 currently.

1? does is 2? @jlowe

@JustPlay
Copy link
Author

JustPlay commented Sep 23, 2020

@jlowe

i am more confused;

When the GpushuffledHashJoin query node exec path do semaphore acquire-release at batch-lifetime level, why #679 (move some deserialization code out of the scope of the gpu-semaphore to increase cpu concurrent) has nearly no performance gain ?

because,after we had moved the deserialization code out of the scope of the gpu-semaphore,the disk/net io, decompress in the shuffle read will just like the GpuScan (fist read file from disk to pin-memory, then acquire semaphore). the cpu part should be overlaped and thus latency hiding by the gpu part

@jlowe
Copy link
Contributor

jlowe commented Sep 23, 2020

if so, this will be one of the OOM reason

Yes, there are many reasons how an OOM can happen with concurrent tasks. All the tasks may hit their maximum GPU memory pressure at the same time, also causing an OOM. Sometimes they don't line up and it doesn't OOM. That's the nature of running concurrent. We don't have a good plan for solving the simultaneous maximum pressure problem, but we do have a plan to solve the issue of leaving the build batch table in memory without the semaphore, and that's by making it automatically spillable. Build-side tables, like broadcast tables, are often small so we don't normally want to avoid releasing the semaphore because they're there. We'd rather let them spill if that's necessary to avoid an OOM.

In the short term, we're recommending people run with less GPU concurrency if they are hitting an OOM condition with it. Even if we hold onto the GPU semaphore with the build batch being left as a "fix" it will not prevent the coinciding maximum memory pressure problem that can lead to sporadic OOMs, and it will actively harm queries where there is CPU-only processing after the join in the same stage (e.g.: some part of the plan that we can't place on the GPU, like an expensive project with UDF).

1? does is 2?

Yes, the default for concurrent GPU tasks is 1 as seen in the config definition.

why #679 (move some deserialization code out of the scope of the gpu-semaphore to increase cpu concurrent) has nearly no performance gain ?

This was covered by Bobby's comment where he said it would only benefit the very first batch, not subsequent batches. This should be visible in an nsys profile trace, I highly recommend using those traces to see when tasks are blocked on the GPU semaphore and when they acquire/release it. There are NVTX ranges used when acquiring the semaphore and when releasing it (the latter being very small since it doesn't block to release). See the NVTX profiling doc in the developer docs section for more details.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Sep 29, 2020
@NVIDIA NVIDIA locked and limited conversation to collaborators Apr 28, 2022
@sameerz sameerz converted this issue into discussion #5388 Apr 28, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants