-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Multiple threads sharing the same GPU #15
Comments
The consensus from RMM and cuDF is to rely on per-thread default stream (PTDS). |
Here is the design proposal. MotivationSpark is multi-threaded on the host (CPU) side. Input is divided into partitions, and multiple tasks are launched on each CPU to work on these partitions. Each task typically runs in a single thread, obtained from a thread pool. This task level parallelism ensures all the CPU cores are utilized to increase overall throughput. With the To increase GPU throughput, it is desirable to execute each task or host thread in a separate, asynchronous stream. Ideally the legacy default stream is never used so as not to trigger implicit synchronization. Goals
Non-Goals
Assumptions
Risks
DesignCUDA 7 introduced the per-thread default stream (PTDS), which has two effects. First, it gives each host thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently. Second, these default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams. To enable PTDS, we can compile with the The work involved can be roughly divided into two parts, basic mechanism and correctness. Basic Mechanism
Correctness
Finally, we should run some benchmarks with PTDS enabled. These can be standard TPC-H or TPC-DS queries, or real customer queries if possible.
After verifying that everything works correctly, the spark-rapids distribution of RMM/cuDF can enable PTDS by default.
Alternatives ConsideredMost (if not all) cuDF functions support passing in an extra parameter for the stream to execute on, although currently these stream parameters are not exposed in cuDF’s public API. One alternative is to make these parameters public and for the spark-rapids plugin to manage CUDA streams explicitly. Each task can grab its own stream, preferably from a stream pool, and pass it to all subsequent cuDF calls. The benefit of this approach is that the intent is more explicit, and the code doesn’t need to be compiled with the per-thread default stream flag. The drawback is that the client code needs to manage streams, leading to more complexity. |
fyi, you can explicitly pass |
Closing this as the list of items is done. Will open another one for performance tuning. |
Signed-off-by: spark-rapids automation <[email protected]>
log as info Signed-off-by: Firestarman <[email protected]>
* works, ugly Signed-off-by: Hongbin Ma (Mahone) <[email protected]> refine code, broken Signed-off-by: Hongbin Ma (Mahone) <[email protected]> fix Signed-off-by: Hongbin Ma (Mahone) <[email protected]> complete version 1 Signed-off-by: Hongbin Ma (Mahone) <[email protected]> Introduce lore id Introduce lore id Fix type Fix type with loreid Signed-off-by: Hongbin Ma (Mahone) <[email protected]> clean ut Signed-off-by: Hongbin Ma (Mahone) <[email protected]> refine log Signed-off-by: Hongbin Ma (Mahone) <[email protected]> clean Signed-off-by: Hongbin Ma (Mahone) <[email protected]> fix ut Signed-off-by: Hongbin Ma (Mahone) <[email protected]> fix idgen Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * fix it Signed-off-by: Hongbin Ma (Mahone) <[email protected]> --------- Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
Is your feature request related to a problem? Please describe.
Currently all of cudf processing uses a single stream which blocks all other streams from sharing the GPU. The plugin uses a semaphore to prevent too many processes on the GPU at once. This works to overlap some I/O with processing, but it does not work if you want to overlap processing.
Describe the solution you'd like
I would be great to be able to support multiple streams, possibly using per-thread-default streams.
The text was updated successfully, but these errors were encountered: