-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash with drjit.set_device() in some settings #119
Comments
Hi @yutoe05
This does make it rather sound like their is an issue with your environment than in Dr.Jit. Especially in the examples 1 and 3 where all GPUs are identical. You could try some other software/library to see if you're getting similar issues. Please report back if you still believe that this is a bug in our implementation. |
Thanks for your answer, @njroussel. I tried cupy, numba and pytorch on multiple GPUs. environments:
I run these codes:
Using breakpoint(), I checked that the specified device was actually used on nvidia-smi. Certainly the behavior of mitsuba3 seems to be environmentally dependent, but I don't get similar issues with other libraries, so It may be a bug. |
Thank you for this, it does seem to point to an issue on our end. We don't have any multi-GPU setups so it's a bit hard to debug this futher on our end. I understand that it's not the most elegant, but is there anything stopping you from always using the |
Thanks for your response. I would like to use mitsuba3 when training network of pytorch with data parallel on multiple GPUs. I'll look for any workaround. |
Thank you for the update. Indeed I hadn't thought of that 😅. Technically, I think you could just do something like I might have an idea. Could you try re-ordering your imports and setup as follows:
My best guess is that there is some global device memory allocated on the default device (device 0) when importing |
Thank you for great ideas! First, I tried the following:
The result is
results in num_gpus > 0 and drjit using the device with Second, I tried setting device by drjit before importing mitsuba3, but the same error occured. Third, when adding
or Therefore, I think this issue may be involved in Optix, which is not used in other libraries I tried. I also found that some memory is allocated on all visible GPUs when importing drjit. I'm sorry, but I'm going to be busy for a few weeks and may not be able to respond immediately. |
I apologize for the delayed response. I still don't know how to use torch and drjit while switching between multiple GPU devices. I have found, however, that by spliting a single host into multiple nodes in DistributedDataParallel of torch, along with the CUDA_VISIBLE_DEVICES environment variable, I can achieve the initial goal. Perhaps due to bandwidth constraints, the desired processing speed could not be achieved, Thank you very much, @njroussel, for taking the time to help me. |
Thanks for this great tool!
I would like to use mitsuba3 with pytorch on multiple gpus.
But switching device by
drjit.set_device(device_idx)
(device_idx > 0) causes a crash in some settings/systems and shows this error:or
These are examples of setting where a crash occurs.
common information:
example 1
CUDA_ERROR_INVALID_VALUE
example 2
CUDA_ERROR_ILLEGAL_ADDRESS
CUDA_ERROR_INVALID_VALUE
example 3
CUDA_ERROR_INVALID_VALUE
I run this simple code:
Error doesn't occur when I use
CUDA_VISIBLE_DEVICES=device_idx
anddrjit.set_device(0)
.I also found following two cases.
CUDA_VISIBLE_DEVICES
.CUDA_VISIBLE_DEVICES
.So, these errors may be due to my environment.
Do you have any advice?
Thanks in advance!
The text was updated successfully, but these errors were encountered: