Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prmon spawning nvidia-smi? #177

Closed
vrpascuzzi opened this issue Nov 17, 2020 · 11 comments
Closed

prmon spawning nvidia-smi? #177

vrpascuzzi opened this issue Nov 17, 2020 · 11 comments
Labels
question Further information is requested

Comments

@vrpascuzzi
Copy link

When launching a relatively large number of parallel athena jobs (which uses prmon), an equally large number of nvidia-smi processes are spawned. Note that I am not using a GPU or any CUDA code in these jobs, but a CUDA installation is found when configuring cmake. Also, these spawned processes continue to run even after logging out of the machine.

While I admit our machine hasn't been too stable these days, this behaviour -- many nvidia-smi processes being launched "behind the scenes" -- is causing a major overload of the system, ultimately requiring a reboot.

Apologies in advance if this is unrelated to prmon.

@graeme-a-stewart graeme-a-stewart added the question Further information is requested label Nov 17, 2020
@graeme-a-stewart
Copy link
Member

Hi Vince

Thanks for the report. It is true that prmon will run nvidia-smi if it finds it. Also, if there is a GPU found then each monitoring cycle nvidia-smi wil be invoked to see if there's any processes that have started on the GPU by the monitored job. (In contrast, if you don't have a GPU then prmon will forget about GPU monitoring.)

We never saw an issue with the nvidia-smi processes hanging and in fact prmon waits will the nvidia-smi has exited so that it can read the output (there's a waitpit() call).

So... I doubt that it's a prmon spawned monitor process, but maybe you could check by doing something like:

  • Running nvidia-smi pmon -s um -c 1 (code) - check that it exits.

    • BTW, can you see if the orphaned nvidia-smis have those command line arguments?
  • Running prmon -- sleep 300 and check there's no accumulation of nvidia-smi processes.

At least that could give us a clue as to whether there is something fishy going on from prmon.

Graeme

@vrpascuzzi
Copy link
Author

Thanks, @graeme-a-stewart. I will follow-up when our machine is back online.
In the meantime, is there an option to disable GPU monitoring?

@graeme-a-stewart
Copy link
Member

HI @vrpascuzzi - ah, I'm sorry that's not available at the moment. @amete and I were having a long discussion about it a while ago (#107) but we didn't converge on what the syntax should be. But given what your asking for I think it makes it clearer that something like

--disable nvidiamon

would do exactly what you want, right? We'll try and reinvigorate that and conclude for the next realease.

@vrpascuzzi
Copy link
Author

That would work.

@graeme-a-stewart
Copy link
Member

Hi @vrpascuzzi, just to say we did implement a way to disable particular monitors in master now (#178). Did you make any progress re. seeing if it was prmon that was launching the nividia-smi processes that looked stuck?

@cgleggett
Copy link

I would suggest having the default be no GPU monitoring, with it enabled on request, as I think having a GPU is much less common than not having.

The issue that Vince and I were having is likely due to the fact that there are 3 GPUs on the server, 2 NVIDIA and one AMD. The kernel crash logs suggest that the problem lies with the kernel trying to switch between GPUs, and something bad happening with the AMD amdgpu kernel module, which somehow ends up corrupting the process table. The server has 72 cores, so when fully loaded, there were a lot of nvidia-smi processes running.

@cgleggett
Copy link

BTW, I removed nvidia-smi from the default path on the server, and since then we haven't had any issues. So while not a smoking gun, is rather suggestive.

@graeme-a-stewart
Copy link
Member

@cgleggett - sorry to reply on this so late. Yes, that is a but suggestive, but it still doesn't quite square with the way that prmon works. Very curious. We were asked by ADC to have this moitoring enabled by default, but it would be nicer if it could be signalled to prmon that there will be a job that uses the GPU.

@amete
Copy link
Collaborator

amete commented Jan 21, 2021

One practical option might be to leave things as is on the prmon side (i.e. everything is enabled by default such that ADC doesn't need to do anything special) but disable GPU metric collection in the instance of prmon that the job transform spawns. At this point, we don't really make use of any GPU resources from athena anyways - at least for now. Therefore, we wouldn't be loosing anything.

@graeme-a-stewart
Copy link
Member

Hi again @vrpascuzzi @cgleggett

Further considering this issue we realised that the fix wouldn't help when its the job transform that's launching prmon as you don't have access to the arguments it's invoked with. So we just added a new feature, where you can disable monitors via the PRMON_DISABLE_MONITOR environment variable (see #183 #182).

Unfortunately that will only work from the next realease (but we are about to cut that).

I think that's as much as we can do, so we'll close this issue here, but if you have any further insight into why this was behaving in such an odd way on your node please let us know.

Cheers, g.

@cgleggett
Copy link

Thanks Graeme!

I think we are definitely exploring an unusual corner of phase space, where we have many concurrent jobs, each spawning an nvidia-smi process, and a multi-gpu machine which has an AMD card in it. There are a few articles online that mention issues with the AMD amdgpu kernel module relating to the kernel switching between physical gpus, but it doesn't look common. Between this env var, and the fact that I now only put the nvidia executables in the PATH when I want to do something explicitly with the GPU, I think our issue is addressed. It will be interesting to see if this ever affects others.

cheers, Charles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants