CPU at 100%+ when Nvidia drivers are installed #6057

numkem · 2019-08-01T14:18:57Z

Nomad version

Nomad 0.9.4

Operating system and Environment details

Debian 9 and 10

Issue

When the nvidia drivers are installed and nomad detects that an nvidia card can be used, nomad uses 100%+ CPU, usually hovering around 116% (on a dual core machine).

Reproduction steps

Install Debian 10, install nvidia-driver from the backports repos and start nomad.

Job file (if appropriate)

N/A

Nomad Client logs (if appropriate)

I've sent the pprof file generated by nomad to the oss email.

notnoop · 2019-08-02T15:55:12Z

Thanks for reporting the issue. It looks like the issue is that we collect device stats every second, and it seems looking up the gpu temperature is slow e.g. https://github.com/hashicorp/nomad/blob/v0.9.4/plugins/shared/cmd/launcher/command/device.go#L345-L351 .

Given that stats collectors usually sample every 10 seconds or so, collecting stats every second is excessive. We should downsample it to 5 or 10 seconds, and see if there are some optimizations we can apply.

numkem · 2019-08-02T16:26:48Z

I probably should add the NVIDIA card in question is a GTX 650, so rather old but good enough for transcoding.

nvidia-smi reports the right temperature.

ionosphere80 · 2019-08-08T23:30:51Z

I'm also having this issue on a p2.xlarge instance in AWS with a Tesla K80 GPU running Ubuntu 18.04.

ionosphere80 · 2019-08-15T02:49:03Z

Any plans to fix this issue?

endocrimes · 2019-08-15T13:20:57Z

@ionosphere80 @numkem the short term solution would be to configure the nomad telemetry stanza to have a collection interval of ~10s. We'll attempt to adjust this in a future release but we may have to rethink some of our nvidia stats collection.

numkem · 2019-08-15T13:59:13Z

@endocrimes I've tried what you said and even removing the whole telemetry stanza to no avail.

Fixes a bug where we cpu is pigged at 100% due to collecting devices statistics. The passed stats interval was ignored, and the default zero value causes a very tight loop of stats collection. FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a `g2.2xlarge` ec2 instance. The stats interval defaults to 1 second and is user configurable. I believe this is too frequent as a default, and I may advocate for reducing it to a value closer to 5s or 10s, but keeping it as is for now. Fixes #6057 .

github-actions · 2022-11-19T02:26:46Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop added theme/devices theme/plugin labels Aug 2, 2019

notnoop added the type/bug label Aug 2, 2019

endocrimes mentioned this issue Aug 15, 2019

[wip] devices: Downsample stats collection every 10s #6133

Closed

notnoop mentioned this issue Aug 23, 2019

initialize device manager stats interval #6201

Merged

notnoop closed this as completed in #6201 Aug 24, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU at 100%+ when Nvidia drivers are installed #6057

CPU at 100%+ when Nvidia drivers are installed #6057

numkem commented Aug 1, 2019

notnoop commented Aug 2, 2019

numkem commented Aug 2, 2019

ionosphere80 commented Aug 8, 2019

ionosphere80 commented Aug 15, 2019

endocrimes commented Aug 15, 2019

numkem commented Aug 15, 2019

github-actions bot commented Nov 19, 2022

CPU at 100%+ when Nvidia drivers are installed #6057

CPU at 100%+ when Nvidia drivers are installed #6057

Comments

numkem commented Aug 1, 2019

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

Nomad Client logs (if appropriate)

notnoop commented Aug 2, 2019

numkem commented Aug 2, 2019

ionosphere80 commented Aug 8, 2019

ionosphere80 commented Aug 15, 2019

endocrimes commented Aug 15, 2019

numkem commented Aug 15, 2019

github-actions bot commented Nov 19, 2022