[BUG] `cudf::binary_operation` ignores cuda context when registering JIT compiled PTX #5133

magnatelee · 2020-05-07T21:35:07Z

Describe the bug

cudf::binary_operation currently ignores the CUDA context of the caller thread, which makes the JIT compiled PTX loaded on a wrong device. Even worse is that cudf::binary_operation does not check the CUresult from the kernel launch, so the error is being silently ignored, and noticed only with cuda-memcheck.

The text was updated successfully, but these errors were encountered:

jlowe · 2020-05-08T15:37:50Z

Any idea how the calling thread's context is being ignored here? Is this a case where a thread is being created without an explicit context and CUDA is auto-selecting a (potentially incorrect) device when it implicitly initializes the context? If so that will cause problems in Spark with the RAPIDS plugins in a multi-GPU setup where the GPU device is assigned at the application level (not through CUDA_VISIBLE_DEVICES).

devavret · 2020-05-08T15:45:12Z

This is happening in Jit where the compiled kernel is being registered with only one context. On a subsequent call from a different context, this fails. It would affect cases where the different threads are assigned different devices, as is the case with @magnatelee's usage.

If spark uses one libcudf process per GPU then this won't affect it. It if uses one thread per GPU then it will.

devavret · 2020-05-08T15:46:06Z

I'm investigating a fix such that the in-memory cache is stored per context.

jlowe · 2020-05-08T15:48:59Z

If spark uses one libcudf process per GPU then this won't affect it.

Ah, great to hear. The Spark RAPIDS plugin currently only uses one GPU per process.

harrism · 2020-05-20T04:33:11Z

So what is left in this bug, @devavret ?

devavret · 2020-05-20T05:27:49Z

The issue also asks for a check

Even worse is that cudf::binary_operation does not check the CUresult from the kernel launch, so the error is being silently ignored, and noticed only with cuda-memcheck.

I implemented this in NVIDIA/jitify#67 and after that the launch call in Launcher.h needs to be replaced withsafe_launch

jrhemstad · 2021-02-03T03:09:33Z

@devavret @magnatelee is this still an issue?

devavret · 2021-02-03T03:25:44Z

Not a lot is left. Just needs to replace the launch in

cudf/cpp/src/jit/launcher.h

Line 97 in 2780a8c

    
           get_kernel().configure_1d_max_occupancy(0, 0, 0, stream.value()).launch(args...);

with safe_launch. I'll make a quick PR tomorrow.

ttnghia · 2021-03-03T15:12:23Z

Hi! How is this going?

devavret · 2021-03-03T19:40:06Z

I had a branch for it but I can't find it anymore. Must be lost in my corrupted git. Here's a new one #7510

@devavret

Final step, closes #5133 Authors: - Devavret Makkar (@devavret) Approvers: - Nghia Truong (@ttnghia) - Vukasin Milovanovic (@vuule) URL: #7510

@devavret

Final step, closes rapidsai#5133 Authors: - Devavret Makkar (@devavret) Approvers: - Nghia Truong (@ttnghia) - Vukasin Milovanovic (@vuule) URL: rapidsai#7510

magnatelee added Needs Triage Need team to review and classify bug Something isn't working labels May 7, 2020

devavret self-assigned this May 7, 2020

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels May 12, 2020

devavret mentioned this issue May 18, 2020

[REVIEW] Jit context cache #5219

Merged

devavret mentioned this issue Mar 3, 2021

Change jit launch to safe_launch #7510

Merged

rapids-bot bot closed this as completed in #7510 Mar 4, 2021

rapids-bot bot pushed a commit that referenced this issue Mar 4, 2021

Change jit launch to safe_launch (#7510)

b20f19d

Final step, closes #5133 Authors: - Devavret Makkar (@devavret) Approvers: - Nghia Truong (@ttnghia) - Vukasin Milovanovic (@vuule) URL: #7510

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `cudf::binary_operation` ignores cuda context when registering JIT compiled PTX #5133

[BUG] `cudf::binary_operation` ignores cuda context when registering JIT compiled PTX #5133

magnatelee commented May 7, 2020

jlowe commented May 8, 2020

devavret commented May 8, 2020

devavret commented May 8, 2020

jlowe commented May 8, 2020

harrism commented May 20, 2020

devavret commented May 20, 2020

jrhemstad commented Feb 3, 2021

devavret commented Feb 3, 2021 •

edited

Loading

ttnghia commented Mar 3, 2021

devavret commented Mar 3, 2021

[BUG] cudf::binary_operation ignores cuda context when registering JIT compiled PTX #5133

[BUG] cudf::binary_operation ignores cuda context when registering JIT compiled PTX #5133

Comments

magnatelee commented May 7, 2020

jlowe commented May 8, 2020

devavret commented May 8, 2020

devavret commented May 8, 2020

jlowe commented May 8, 2020

harrism commented May 20, 2020

devavret commented May 20, 2020

jrhemstad commented Feb 3, 2021

devavret commented Feb 3, 2021 • edited Loading

ttnghia commented Mar 3, 2021

devavret commented Mar 3, 2021

[BUG] `cudf::binary_operation` ignores cuda context when registering JIT compiled PTX #5133

[BUG] `cudf::binary_operation` ignores cuda context when registering JIT compiled PTX #5133

devavret commented Feb 3, 2021 •

edited

Loading