-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU at 100%+ when Nvidia drivers are installed #6057
Comments
Thanks for reporting the issue. It looks like the issue is that we collect device stats every second, and it seems looking up the gpu temperature is slow e.g. https://github.com/hashicorp/nomad/blob/v0.9.4/plugins/shared/cmd/launcher/command/device.go#L345-L351 . Given that stats collectors usually sample every 10 seconds or so, collecting stats every second is excessive. We should downsample it to 5 or 10 seconds, and see if there are some optimizations we can apply. |
I probably should add the NVIDIA card in question is a GTX 650, so rather old but good enough for transcoding. nvidia-smi reports the right temperature. |
I'm also having this issue on a p2.xlarge instance in AWS with a Tesla K80 GPU running Ubuntu 18.04. |
Any plans to fix this issue? |
@ionosphere80 @numkem the short term solution would be to configure the nomad telemetry stanza to have a collection interval of ~10s. We'll attempt to adjust this in a future release but we may have to rethink some of our nvidia stats collection. |
@endocrimes I've tried what you said and even removing the whole |
Fixes a bug where we cpu is pigged at 100% due to collecting devices statistics. The passed stats interval was ignored, and the default zero value causes a very tight loop of stats collection. FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a `g2.2xlarge` ec2 instance. The stats interval defaults to 1 second and is user configurable. I believe this is too frequent as a default, and I may advocate for reducing it to a value closer to 5s or 10s, but keeping it as is for now. Fixes #6057 .
Fixes a bug where we cpu is pigged at 100% due to collecting devices statistics. The passed stats interval was ignored, and the default zero value causes a very tight loop of stats collection. FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a `g2.2xlarge` ec2 instance. The stats interval defaults to 1 second and is user configurable. I believe this is too frequent as a default, and I may advocate for reducing it to a value closer to 5s or 10s, but keeping it as is for now. Fixes #6057 .
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad 0.9.4
Operating system and Environment details
Debian 9 and 10
Issue
When the nvidia drivers are installed and nomad detects that an nvidia card can be used, nomad uses 100%+ CPU, usually hovering around 116% (on a dual core machine).
Reproduction steps
Install Debian 10, install nvidia-driver from the backports repos and start nomad.
Job file (if appropriate)
N/A
Nomad Client logs (if appropriate)
I've sent the pprof file generated by nomad to the oss email.
The text was updated successfully, but these errors were encountered: