[🐛 bug report] Mitsuba 3 Crashes in a Multi-GPU Environment with Device Set to Non-Zero #808

Microno95 · 2023-07-17T11:51:57Z

Summary

Running mitsuba 3 in an environment where I want to have each process use a different GPU on a multi-GPU machine does not work as Dr.JIT gives a CUDA_ERROR_ILLEGAL_ADDRESS error.

System configuration

System Information:

OS: Rocky Linux release 8.7 (Green Obsidian)
CPU: AMD EPYC 7763 64-Core Processor
GPU: NVIDIA A100-SXM4-80GB
     NVIDIA A100-SXM4-80GB
     NVIDIA A100-SXM4-80GB
     NVIDIA A100-SXM4-80GB
Python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0]
NVidia driver: 525.105.17
CUDA: 11.8.89
LLVM: 14.0.6
Dr.Jit: 0.4.2
Mitsuba: 3.3.0
    Is custom build? True
    Compiled with: GNU 9.3.0
    Variants:
        scalar_rgb
        scalar_spectral
        cuda_ad_rgb
        llvm_ad_rgb

Description

Setting the device to anything other than device 0 for cuda devices leads to a critical Dr.JIT compiler failure with CUDA API Error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS) in drjit-core/src/util.cpp:203.

I am trying to use mitsuba in a multi-gpu multi-node environment to generate a dataset of renders. In order to do so, I use a pytorch environment where one process per node spawns one process per GPU. To enable using each GPU separately, I use torch.cuda.set_device(rank) where rank is the local rank of a process on a given node, similarly I use mi.util.dr.set_device(rank) to set the GPU for mitsuba.

Doing this before loading the scene leads to the above error and crashes the python instance. If I try to set the Dr.JIT device post-loading of a given scene, this leaves the scene still loaded on the first GPU.

While using the environment variable "CUDA_VISIBLE_DEVICES" works as expected, this prevents using mitsuba 3 in an environment where a single process may want to use multiple GPUs and prevents programmatically managing CUDA processes.

Steps to reproduce

Run import mitsuba as mi;mi.set_variant("cuda_ad_rgb");mi.util.dr.set_device(1)
Load the Cornell Box scene with mi.load_dict(mi.cornell_box())

The text was updated successfully, but these errors were encountered:

njroussel · 2023-07-18T08:11:28Z

Hi @Microno95

I thought we had two open issues regarding this, but I can only find this one which is now closed: mitsuba-renderer/drjit#119.

I still believe that something is broken in Dr.Jit with regard to changing devices. Unfortunately, we don't have a multi-GPU machine at our disposal to debug this ourselves.
If anyone wants to look into this, here are three good starting points to look at:

Initialization of the CUDA backend (jitc_cuda_init)
Initialization of a thread's state (jitc_init_thread_state)
Changing CUDA devices (jitc_cuda_set_device)

Microno95 · 2023-07-19T12:41:23Z

Hi @njroussel

I see, I can look into the matter if Dr.Jit supports GTX 1080 GPUs. Hopefully I can provide some insight even if not a solution.

Are there any existing tests in Dr.Jit that test device setting/switching? I'll use that as a starting point to debug the issue.

njroussel · 2023-07-19T13:20:01Z

Yeah, that architecture should still be supported.

I don't think there are any tests for this. To be quite honest, I've always wondered how this was initially implemented 😅 I can't even guarantee you that this worked properly at any point in time.

Microno95 · 2023-07-19T18:54:36Z

Hey @njroussel,

I got a GTX 1080 installed and started debugging. First, there is a bug in drjit-core specifically on how contexts are used for setting attributes here
https://github.com/mitsuba-renderer/drjit-core/blob/25dd7a5cb96ee58d65cc1499f47de76f6140ff36/src/registry.cpp#L309
and
https://github.com/mitsuba-renderer/drjit-core/blob/25dd7a5cb96ee58d65cc1499f47de76f6140ff36/src/registry.cpp#L340
where the correct statement should be scoped_set_context guard(ts->context);

It's quite a small issue so I didn't necessarily want to create a PR.

The broader problem actually lies in how the device is set on a per-thread basis. The fundamental issue is that calling jit_cuda_set_device only sets the device for the main thread and not any of the other threads. This leads to a CUDA problem where the main thread is loading part of the scene into one device and another thread is loading it onto device 0.

In fact, using dr.set_num_threads(0) alleviates the problem of setting the device and the test suite runs as expected. I'll look more into it, but that's what I've got thus far.

Microno95 · 2023-07-20T13:00:31Z

Upon further investigation, it looks like setting the device once on startup of a script can be made to work, but switching devices during runtime will cause major issues because each variable relies on the device (i.e. the cuda stream and the cuda context) to be constant across operations.

I have a patch to get that working whereby the device is set via global state in drjit and then each thread has the device set appropriately either upon construction or whenever jit_cuda_set_device is called but it's not stable in that changing the device during runtime leads to horrific errors (i.e. accessing one GPU's memory while in another GPU's context).

Broadly, the solution is most likely to track the context of each device-side pointer and use that rather than a per-thread context. Not sure how that interacts with different compute streams...

Microno95 mentioned this issue Jul 20, 2023

Crash using any GPU other than GPU 0 mitsuba-renderer/drjit-core#64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 bug report] Mitsuba 3 Crashes in a Multi-GPU Environment with Device Set to Non-Zero #808

[🐛 bug report] Mitsuba 3 Crashes in a Multi-GPU Environment with Device Set to Non-Zero #808

Microno95 commented Jul 17, 2023 •

edited

Loading

njroussel commented Jul 18, 2023

Microno95 commented Jul 19, 2023

njroussel commented Jul 19, 2023

Microno95 commented Jul 19, 2023

Microno95 commented Jul 20, 2023

[🐛 bug report] Mitsuba 3 Crashes in a Multi-GPU Environment with Device Set to Non-Zero #808

[🐛 bug report] Mitsuba 3 Crashes in a Multi-GPU Environment with Device Set to Non-Zero #808

Comments

Microno95 commented Jul 17, 2023 • edited Loading

Summary

System configuration

Description

Steps to reproduce

njroussel commented Jul 18, 2023

Microno95 commented Jul 19, 2023

njroussel commented Jul 19, 2023

Microno95 commented Jul 19, 2023

Microno95 commented Jul 20, 2023

Microno95 commented Jul 17, 2023 •

edited

Loading