-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prmon spawning nvidia-smi? #177
Comments
Hi Vince Thanks for the report. It is true that prmon will run We never saw an issue with the So... I doubt that it's a prmon spawned monitor process, but maybe you could check by doing something like:
At least that could give us a clue as to whether there is something fishy going on from prmon. Graeme |
Thanks, @graeme-a-stewart. I will follow-up when our machine is back online. |
HI @vrpascuzzi - ah, I'm sorry that's not available at the moment. @amete and I were having a long discussion about it a while ago (#107) but we didn't converge on what the syntax should be. But given what your asking for I think it makes it clearer that something like
would do exactly what you want, right? We'll try and reinvigorate that and conclude for the next realease. |
That would work. |
Hi @vrpascuzzi, just to say we did implement a way to disable particular monitors in master now (#178). Did you make any progress re. seeing if it was prmon that was launching the nividia-smi processes that looked stuck? |
I would suggest having the default be no GPU monitoring, with it enabled on request, as I think having a GPU is much less common than not having. The issue that Vince and I were having is likely due to the fact that there are 3 GPUs on the server, 2 NVIDIA and one AMD. The kernel crash logs suggest that the problem lies with the kernel trying to switch between GPUs, and something bad happening with the AMD amdgpu kernel module, which somehow ends up corrupting the process table. The server has 72 cores, so when fully loaded, there were a lot of nvidia-smi processes running. |
BTW, I removed nvidia-smi from the default path on the server, and since then we haven't had any issues. So while not a smoking gun, is rather suggestive. |
@cgleggett - sorry to reply on this so late. Yes, that is a but suggestive, but it still doesn't quite square with the way that prmon works. Very curious. We were asked by ADC to have this moitoring enabled by default, but it would be nicer if it could be signalled to prmon that there will be a job that uses the GPU. |
One practical option might be to leave things as is on the |
Hi again @vrpascuzzi @cgleggett Further considering this issue we realised that the fix wouldn't help when its the job transform that's launching Unfortunately that will only work from the next realease (but we are about to cut that). I think that's as much as we can do, so we'll close this issue here, but if you have any further insight into why this was behaving in such an odd way on your node please let us know. Cheers, g. |
Thanks Graeme! I think we are definitely exploring an unusual corner of phase space, where we have many concurrent jobs, each spawning an nvidia-smi process, and a multi-gpu machine which has an AMD card in it. There are a few articles online that mention issues with the AMD amdgpu kernel module relating to the kernel switching between physical gpus, but it doesn't look common. Between this env var, and the fact that I now only put the nvidia executables in the PATH when I want to do something explicitly with the GPU, I think our issue is addressed. It will be interesting to see if this ever affects others. cheers, Charles. |
When launching a relatively large number of parallel
athena
jobs (which usesprmon
), an equally large number ofnvidia-smi
processes are spawned. Note that I am not using a GPU or any CUDA code in these jobs, but a CUDA installation is found when configuringcmake
. Also, these spawned processes continue to run even after logging out of the machine.While I admit our machine hasn't been too stable these days, this behaviour -- many
nvidia-smi
processes being launched "behind the scenes" -- is causing a major overload of the system, ultimately requiring a reboot.Apologies in advance if this is unrelated to
prmon
.The text was updated successfully, but these errors were encountered: