Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Multiple threads sharing the same GPU #15

Closed
revans2 opened this issue May 28, 2020 · 4 comments
Closed

[FEA] Multiple threads sharing the same GPU #15

revans2 opened this issue May 28, 2020 · 4 comments
Assignees
Labels
feature request New feature or request performance A performance related task/issue SQL part of the SQL/Dataframe plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented May 28, 2020

Is your feature request related to a problem? Please describe.
Currently all of cudf processing uses a single stream which blocks all other streams from sharing the GPU. The plugin uses a semaphore to prevent too many processes on the GPU at once. This works to overlap some I/O with processing, but it does not work if you want to overlap processing.

Describe the solution you'd like
I would be great to be able to support multiple streams, possibly using per-thread-default streams.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue SQL part of the SQL/Dataframe plugin labels May 28, 2020
@rongou
Copy link
Collaborator

rongou commented May 28, 2020

The consensus from RMM and cuDF is to rely on per-thread default stream (PTDS).

@sameerz sameerz added this to the Release 0.2 milestone Jun 12, 2020
@sameerz sameerz removed this from the Jun 22 - Jul 2 milestone Jun 12, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jul 2, 2020
@rongou
Copy link
Collaborator

rongou commented Jul 2, 2020

Here is the design proposal.

Motivation

Spark is multi-threaded on the host (CPU) side. Input is divided into partitions, and multiple tasks are launched on each CPU to work on these partitions. Each task typically runs in a single thread, obtained from a thread pool. This task level parallelism ensures all the CPU cores are utilized to increase overall throughput.

With the spark-rapids plugin, some or all of the work in each task is offloaded to the GPU. To efficiently use the GPU, the CUDA operations in these tasks should execute concurrently. CUDA applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Currently all host threads share the single default CUDA stream. Also known as the legacy default stream or the NULL stream, it implicitly synchronizes with itself and all other streams on the device, causing CUDA operations to be serialized.

To increase GPU throughput, it is desirable to execute each task or host thread in a separate, asynchronous stream. Ideally the legacy default stream is never used so as not to trigger implicit synchronization.

Goals

  • Each Spark task or host thread runs in a separate CUDA stream.
  • The legacy default stream is not used.
  • Test and fix the current code base so that concurrent execution on the GPU works correctly, e.g. no race conditions, deadlocks etc.

Non-Goals

  • Once tasks execute concurrently on GPUs, there may be further opportunities to tune current approaches to get better performance. This is outside the scope of this issue.
  • Runtime compilation using NVRTC/Jitify is only for device code, which is oblivious to streams (unless using CUDA dynamic parallelism).

Assumptions

  • There shouldn’t be any big structural changes to the code base, either the underlying C++ libraries, or the Java/Scala code in the spark-rapids plugin.
  • A Spark task shouldn’t need to manipulate CUDA streams directly, or deal with multiple streams.
  • Some background tasks such as asynchronous spill may operate with their own streams.
  • GPU memories allocated in one task/stream may be deallocated in another task/stream.

Risks

  • Concurrent execution of GPU code may be hard to reason about or debug. We need good test coverage to make sure the code works correctly.
  • Some third party libraries may never be compiled with per-thread default stream enabled. In the case of XGBoost or other machine learning libraries, since model training does not overlap with ETL, using the legacy default stream during training should not have an adverse effect. However, libraries used during ETL may cause excessive synchronization when PTDS is not enabled.
  • There are upcoming CUDA features that may impact some key components in the stack.

Design

CUDA 7 introduced the per-thread default stream (PTDS), which has two effects. First, it gives each host thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently. Second, these default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams.

To enable PTDS, we can compile with the nvcc command-line option --default-stream per-thread. Once this is done on all CUDA code, each Spark thread would use its own default stream, and operations from different threads would execute concurrently on the GPU.

ptds

The work involved can be roughly divided into two parts, basic mechanism and correctness.

Basic Mechanism

Correctness

Finally, we should run some benchmarks with PTDS enabled. These can be standard TPC-H or TPC-DS queries, or real customer queries if possible.

  • Benchmarks

After verifying that everything works correctly, the spark-rapids distribution of RMM/cuDF can enable PTDS by default.

  • Enable PTDS in spark-rapids distribution

Alternatives Considered

Most (if not all) cuDF functions support passing in an extra parameter for the stream to execute on, although currently these stream parameters are not exposed in cuDF’s public API. One alternative is to make these parameters public and for the spark-rapids plugin to manage CUDA streams explicitly. Each task can grab its own stream, preferably from a stream pool, and pass it to all subsequent cuDF calls. The benefit of this approach is that the intent is more explicit, and the code doesn’t need to be compiled with the per-thread default stream flag. The drawback is that the client code needs to manage streams, leading to more complexity.

@jrhemstad
Copy link

  • Some third party libraries may never be compiled with per-thread default stream enabled.

fyi, you can explicitly pass cudaStreamPerThread into external libraries to get PTDS behavior even if the external library wasn't complied with PTDS enabled.

@rongou
Copy link
Collaborator

rongou commented Aug 8, 2020

Closing this as the list of items is done. Will open another one for performance tuning.

@rongou rongou closed this as completed Aug 8, 2020
@sameerz sameerz changed the title [FEA] Multiple threads shareing the same GPU [FEA] Multiple threads sharing the same GPU Sep 21, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <[email protected]>
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this issue Jan 16, 2024
binmahone added a commit to binmahone/spark-rapids that referenced this issue Jun 17, 2024
* works, ugly

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

refine code, broken

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

complete version 1

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Introduce lore id

Introduce lore id

Fix type

Fix type

with loreid

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

clean ut

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

refine log

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

clean

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix ut

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix idgen

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* fix it

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

---------

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

No branches or pull requests

4 participants