-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🐛 bug report] Mitsuba 3 Crashes in a Multi-GPU Environment with Device Set to Non-Zero #808
Comments
Hi @Microno95 I thought we had two open issues regarding this, but I can only find this one which is now closed: mitsuba-renderer/drjit#119. I still believe that something is broken in Dr.Jit with regard to changing devices. Unfortunately, we don't have a multi-GPU machine at our disposal to debug this ourselves.
|
Hi @njroussel I see, I can look into the matter if Dr.Jit supports GTX 1080 GPUs. Hopefully I can provide some insight even if not a solution. Are there any existing tests in Dr.Jit that test device setting/switching? I'll use that as a starting point to debug the issue. |
Yeah, that architecture should still be supported. I don't think there are any tests for this. To be quite honest, I've always wondered how this was initially implemented 😅 I can't even guarantee you that this worked properly at any point in time. |
Hey @njroussel, I got a GTX 1080 installed and started debugging. First, there is a bug in drjit-core specifically on how contexts are used for setting attributes here It's quite a small issue so I didn't necessarily want to create a PR. The broader problem actually lies in how the device is set on a per-thread basis. The fundamental issue is that calling In fact, using |
Upon further investigation, it looks like setting the device once on startup of a script can be made to work, but switching devices during runtime will cause major issues because each variable relies on the device (i.e. the cuda stream and the cuda context) to be constant across operations. I have a patch to get that working whereby the device is set via global state in drjit and then each thread has the device set appropriately either upon construction or whenever Broadly, the solution is most likely to track the context of each device-side pointer and use that rather than a per-thread context. Not sure how that interacts with different compute streams... |
Summary
Running mitsuba 3 in an environment where I want to have each process use a different GPU on a multi-GPU machine does not work as Dr.JIT gives a CUDA_ERROR_ILLEGAL_ADDRESS error.
System configuration
System Information:
Description
Setting the device to anything other than device 0 for cuda devices leads to a critical Dr.JIT compiler failure with CUDA API Error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS) in
drjit-core/src/util.cpp:203
.I am trying to use mitsuba in a multi-gpu multi-node environment to generate a dataset of renders. In order to do so, I use a pytorch environment where one process per node spawns one process per GPU. To enable using each GPU separately, I use
torch.cuda.set_device(rank)
where rank is the local rank of a process on a given node, similarly I usemi.util.dr.set_device(rank)
to set the GPU for mitsuba.Doing this before loading the scene leads to the above error and crashes the python instance. If I try to set the Dr.JIT device post-loading of a given scene, this leaves the scene still loaded on the first GPU.
While using the environment variable "CUDA_VISIBLE_DEVICES" works as expected, this prevents using mitsuba 3 in an environment where a single process may want to use multiple GPUs and prevents programmatically managing CUDA processes.
Steps to reproduce
import mitsuba as mi;mi.set_variant("cuda_ad_rgb");mi.util.dr.set_device(1)
mi.load_dict(mi.cornell_box())
The text was updated successfully, but these errors were encountered: