[FEA] Multiple threads sharing the same GPU #15

revans2 · 2020-05-28T19:53:27Z

Is your feature request related to a problem? Please describe.
Currently all of cudf processing uses a single stream which blocks all other streams from sharing the GPU. The plugin uses a semaphore to prevent too many processes on the GPU at once. This works to overlap some I/O with processing, but it does not work if you want to overlap processing.

Describe the solution you'd like
I would be great to be able to support multiple streams, possibly using per-thread-default streams.

rongou · 2020-05-28T20:29:06Z

The consensus from RMM and cuDF is to rely on per-thread default stream (PTDS).

rongou · 2020-07-02T17:49:09Z

Here is the design proposal.

Motivation

Spark is multi-threaded on the host (CPU) side. Input is divided into partitions, and multiple tasks are launched on each CPU to work on these partitions. Each task typically runs in a single thread, obtained from a thread pool. This task level parallelism ensures all the CPU cores are utilized to increase overall throughput.

With the spark-rapids plugin, some or all of the work in each task is offloaded to the GPU. To efficiently use the GPU, the CUDA operations in these tasks should execute concurrently. CUDA applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Currently all host threads share the single default CUDA stream. Also known as the legacy default stream or the NULL stream, it implicitly synchronizes with itself and all other streams on the device, causing CUDA operations to be serialized.

To increase GPU throughput, it is desirable to execute each task or host thread in a separate, asynchronous stream. Ideally the legacy default stream is never used so as not to trigger implicit synchronization.

Goals

Each Spark task or host thread runs in a separate CUDA stream.
The legacy default stream is not used.
Test and fix the current code base so that concurrent execution on the GPU works correctly, e.g. no race conditions, deadlocks etc.

Non-Goals

Once tasks execute concurrently on GPUs, there may be further opportunities to tune current approaches to get better performance. This is outside the scope of this issue.
Runtime compilation using NVRTC/Jitify is only for device code, which is oblivious to streams (unless using CUDA dynamic parallelism).

Assumptions

There shouldn’t be any big structural changes to the code base, either the underlying C++ libraries, or the Java/Scala code in the spark-rapids plugin.
A Spark task shouldn’t need to manipulate CUDA streams directly, or deal with multiple streams.
Some background tasks such as asynchronous spill may operate with their own streams.
GPU memories allocated in one task/stream may be deallocated in another task/stream.

Risks

Concurrent execution of GPU code may be hard to reason about or debug. We need good test coverage to make sure the code works correctly.
Some third party libraries may never be compiled with per-thread default stream enabled. In the case of XGBoost or other machine learning libraries, since model training does not overlap with ETL, using the legacy default stream during training should not have an adverse effect. However, libraries used during ETL may cause excessive synchronization when PTDS is not enabled.
There are upcoming CUDA features that may impact some key components in the stack.

Design

CUDA 7 introduced the per-thread default stream (PTDS), which has two effects. First, it gives each host thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently. Second, these default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams.

To enable PTDS, we can compile with the nvcc command-line option --default-stream per-thread. Once this is done on all CUDA code, each Spark thread would use its own default stream, and operations from different threads would execute concurrently on the GPU.

The work involved can be roughly divided into two parts, basic mechanism and correctness.

Basic Mechanism

Option to enable PTDS for RMM ([REVIEW] add option for per-thread default stream rapidsai/rmm#354)
Option to enable PTDS for cuDF ([REVIEW] Add CMake option for per-thread default stream rapidsai/cudf#4995)
Option to enable PTDS for cuDF JNI ([REVIEW] Add CMake option for per-thread default stream rapidsai/cudf#4995)
Ensure the Thrust library, which is heavily used by cuDF, handles PTDS correctly (support per-thread default stream thrust#1128)
RMM should use the correct version of Thrust ([REVIEW] fetch thrust/cub from github rapidsai/rmm#378)
cuDF should use the correct version of Thrust ([REVIEW] fetch thrust/cub from github rapidsai/cudf#5315)
cuDF JNI should use the correct version of Thrust ([REVIEW] use new release of thrust/cub in jni build [skip ci] rapidsai/cudf#5414)

Correctness

Finally, we should run some benchmarks with PTDS enabled. These can be standard TPC-H or TPC-DS queries, or real customer queries if possible.

Benchmarks

After verifying that everything works correctly, the spark-rapids distribution of RMM/cuDF can enable PTDS by default.

Enable PTDS in spark-rapids distribution

Alternatives Considered

Most (if not all) cuDF functions support passing in an extra parameter for the stream to execute on, although currently these stream parameters are not exposed in cuDF’s public API. One alternative is to make these parameters public and for the spark-rapids plugin to manage CUDA streams explicitly. Each task can grab its own stream, preferably from a stream pool, and pass it to all subsequent cuDF calls. The benefit of this approach is that the intent is more explicit, and the code doesn’t need to be compiled with the per-thread default stream flag. The drawback is that the client code needs to manage streams, leading to more complexity.

jrhemstad · 2020-07-03T02:14:58Z

Some third party libraries may never be compiled with per-thread default stream enabled.

fyi, you can explicitly pass cudaStreamPerThread into external libraries to get PTDS behavior even if the external library wasn't complied with PTDS enabled.

rongou · 2020-08-08T00:45:59Z

Closing this as the list of items is done. Will open another one for performance tuning.

Signed-off-by: spark-rapids automation <[email protected]>

log as info Signed-off-by: Firestarman <[email protected]>

* works, ugly Signed-off-by: Hongbin Ma (Mahone) <[email protected]> refine code, broken Signed-off-by: Hongbin Ma (Mahone) <[email protected]> fix Signed-off-by: Hongbin Ma (Mahone) <[email protected]> complete version 1 Signed-off-by: Hongbin Ma (Mahone) <[email protected]> Introduce lore id Introduce lore id Fix type Fix type with loreid Signed-off-by: Hongbin Ma (Mahone) <[email protected]> clean ut Signed-off-by: Hongbin Ma (Mahone) <[email protected]> refine log Signed-off-by: Hongbin Ma (Mahone) <[email protected]> clean Signed-off-by: Hongbin Ma (Mahone) <[email protected]> fix ut Signed-off-by: Hongbin Ma (Mahone) <[email protected]> fix idgen Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * fix it Signed-off-by: Hongbin Ma (Mahone) <[email protected]> --------- Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue SQL part of the SQL/Dataframe plugin labels May 28, 2020

sameerz added this to the Release 0.2 milestone Jun 12, 2020

sameerz removed this from the Jun 22 - Jul 2 milestone Jun 12, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Jul 2, 2020

sameerz assigned rongou Jul 15, 2020

rongou closed this as completed Aug 8, 2020

rongou mentioned this issue Aug 8, 2020

[FEA] Improve PTDS performance #533

Closed

9 tasks

sameerz mentioned this issue Aug 18, 2020

[FEA] Create a Change Log #573

Closed

sameerz changed the title ~~[FEA] Multiple threads shareing the same GPU~~ [FEA] Multiple threads sharing the same GPU Sep 21, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to b8d2969 (NVIDIA#15)

28a9c26

Signed-off-by: spark-rapids automation <[email protected]>

sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this issue Jan 16, 2024

Make duplicate scan debug log be info. (NVIDIA#15)

e99c1b9

log as info Signed-off-by: Firestarman <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Multiple threads sharing the same GPU #15

[FEA] Multiple threads sharing the same GPU #15

revans2 commented May 28, 2020

rongou commented May 28, 2020

rongou commented Jul 2, 2020 •

edited

Loading

jrhemstad commented Jul 3, 2020

rongou commented Aug 8, 2020

[FEA] Multiple threads sharing the same GPU #15

[FEA] Multiple threads sharing the same GPU #15

Comments

revans2 commented May 28, 2020

rongou commented May 28, 2020

rongou commented Jul 2, 2020 • edited Loading

Motivation

Goals

Non-Goals

Assumptions

Risks

Design

Basic Mechanism

Correctness

Alternatives Considered

jrhemstad commented Jul 3, 2020

rongou commented Aug 8, 2020

rongou commented Jul 2, 2020 •

edited

Loading