Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf::binary_operation ignores cuda context when registering JIT compiled PTX #5133

Closed
magnatelee opened this issue May 7, 2020 · 10 comments · Fixed by #7510
Closed
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@magnatelee
Copy link
Contributor

Describe the bug

cudf::binary_operation currently ignores the CUDA context of the caller thread, which makes the JIT compiled PTX loaded on a wrong device. Even worse is that cudf::binary_operation does not check the CUresult from the kernel launch, so the error is being silently ignored, and noticed only with cuda-memcheck.

@magnatelee magnatelee added Needs Triage Need team to review and classify bug Something isn't working labels May 7, 2020
@devavret devavret self-assigned this May 7, 2020
@jlowe
Copy link
Member

jlowe commented May 8, 2020

Any idea how the calling thread's context is being ignored here? Is this a case where a thread is being created without an explicit context and CUDA is auto-selecting a (potentially incorrect) device when it implicitly initializes the context? If so that will cause problems in Spark with the RAPIDS plugins in a multi-GPU setup where the GPU device is assigned at the application level (not through CUDA_VISIBLE_DEVICES).

@devavret
Copy link
Contributor

devavret commented May 8, 2020

This is happening in Jit where the compiled kernel is being registered with only one context. On a subsequent call from a different context, this fails. It would affect cases where the different threads are assigned different devices, as is the case with @magnatelee's usage.

If spark uses one libcudf process per GPU then this won't affect it. It if uses one thread per GPU then it will.

@devavret
Copy link
Contributor

devavret commented May 8, 2020

I'm investigating a fix such that the in-memory cache is stored per context.

@jlowe
Copy link
Member

jlowe commented May 8, 2020

If spark uses one libcudf process per GPU then this won't affect it.

Ah, great to hear. The Spark RAPIDS plugin currently only uses one GPU per process.

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels May 12, 2020
@harrism
Copy link
Member

harrism commented May 20, 2020

So what is left in this bug, @devavret ?

@devavret
Copy link
Contributor

The issue also asks for a check

Even worse is that cudf::binary_operation does not check the CUresult from the kernel launch, so the error is being silently ignored, and noticed only with cuda-memcheck.

I implemented this in NVIDIA/jitify#67 and after that the launch call in Launcher.h needs to be replaced withsafe_launch

@jrhemstad
Copy link
Contributor

@devavret @magnatelee is this still an issue?

@devavret
Copy link
Contributor

devavret commented Feb 3, 2021

Not a lot is left. Just needs to replace the launch in

get_kernel().configure_1d_max_occupancy(0, 0, 0, stream.value()).launch(args...);
with safe_launch. I'll make a quick PR tomorrow.

@ttnghia
Copy link
Contributor

ttnghia commented Mar 3, 2021

Hi! How is this going?

@devavret
Copy link
Contributor

devavret commented Mar 3, 2021

I had a branch for it but I can't find it anymore. Must be lost in my corrupted git. Here's a new one #7510

@rapids-bot rapids-bot bot closed this as completed in #7510 Mar 4, 2021
rapids-bot bot pushed a commit that referenced this issue Mar 4, 2021
Final step, closes #5133

Authors:
  - Devavret Makkar (@devavret)

Approvers:
  - Nghia Truong (@ttnghia)
  - Vukasin Milovanovic (@vuule)

URL: #7510
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this issue Mar 25, 2021
Final step, closes rapidsai#5133

Authors:
  - Devavret Makkar (@devavret)

Approvers:
  - Nghia Truong (@ttnghia)
  - Vukasin Milovanovic (@vuule)

URL: rapidsai#7510
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants