You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
rmmod nvidia - works
modprobe nvidia - works
nomad agent &
rmmod nvidia - no longer works.
According to nvidia-smi no processes are using the nvidia module, but it's definitely the nomad agent process that's blocking it - even with nvidia-gpu off.
Expected Result
With nvidia-gpu off I should be able to unload the nvidia module.
Actual Result
nomad blocks rmmod
Extra info:
I've tried running with ignored_gpu_ids instead, and then get this message:
2021-08-01T11:39:59.532+0200 [INFO] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
2021-08-01T11:40:05.698+0200 [INFO] client.device_mgr: fingerprinting failed: plugin is not enabled: plugin=nvidia-gpu
So presumably disabling the plugin does work?
The nvidia module is in general allocated to qemu, but I haven't moved any qemu work into nomad yet. This blocks me from testing nomad, as I now can't run my other containers. I've tried turning off both qemu and nvidia-gpu, but it still locks access to the nvidia.ko module.
The text was updated successfully, but these errors were encountered:
Sorry that looking into this got delayed, @andaag.
As of Nomad 1.2.0 the Nvidia driver has been externalized. But either way it shouldn't have been shutting out rmmod if the plugin was disabled. I took a quick look at the code and my suspicion is that when we instantiate the NVML client in NewNvidiaDevice, that the client has some side-effect that's keeping the module locked out.
We need to call NewNvidiaDevice before we can check the enabled flag, but I don't see any reason why we couldn't construct the NVML client lazily in SetConfig(). I don't have a good set up to actually test this theory but it's a small change to do.
I'm going to self-assign this issue but move it to the device plugin repo.
Nomad version
Nomad v1.1.3 (8c0c8140997329136971e66e4c2337dfcf932692)
Operating system and Environment details
Ubuntu 20.04.2 LTS, single cluster, nomad running as user (for testing)
Issue
nomad agent -config=full.nomad
with:
Reproduction steps
rmmod nvidia - works
modprobe nvidia - works
nomad agent &
rmmod nvidia - no longer works.
According to nvidia-smi no processes are using the nvidia module, but it's definitely the nomad agent process that's blocking it - even with nvidia-gpu off.
Expected Result
With nvidia-gpu off I should be able to unload the nvidia module.
Actual Result
nomad blocks rmmod
Extra info:
I've tried running with ignored_gpu_ids instead, and then get this message:
So presumably disabling the plugin does work?
The nvidia module is in general allocated to qemu, but I haven't moved any qemu work into nomad yet. This blocks me from testing nomad, as I now can't run my other containers. I've tried turning off both qemu and nvidia-gpu, but it still locks access to the nvidia.ko module.
The text was updated successfully, but these errors were encountered: