-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UI: Display GPU resource usage #10779
Comments
Just as a point of clarification, Docker has its own nvidia runtime support which we can allow via
I like this idea... but the challenge with presenting that in a reasonable way is that Nomad doesn't actually know that a device is a GPU at all! Nomad's GPU support here is via our device driver protocol. The device driver protocol has a We're also likely to externalize the Nvidia GPU device driver so that it doesn't ship with Nomad by default. See #8330. It doesn't get a lot of use and makes building Nomad on some architectures (ex. ARM) and some Linux distros (ex. Alpine) really painful. |
Interestingly enough (at least in the case of an NVIDIA card), the device plugin exposes the following information on the node status:
"NodeResources": {
"Devices": [
{
"Name": "NVIDIA GeForce GTX 1060 with Max-Q Design",
"Type": "gpu",
"Vendor": "nvidia",
...
}
],
...
}
Notice that in both cases the type is specified as
Oh interesting. So would it be better to add this functionality per external device plugin? Or is it not possible for a device to expose UI metrics in a consistent way? |
That But the stats themselves are coming from the the |
Ah, I see how this gets more complicated. How difficult would it be to attempt this? It seems like the issue linked as a dependency for #10088 (#606) has been closed and I would not mind taking a stab at this. I'm finding that I need to be able to see these stats so I would not mind trying to implement this. |
Well given the discussion above it looks like all the required APIs are already available, so it'd be mostly web UI work. Here I have to admit I don't have a good intuition for the difficulty level and I'm going to ping my colleague @JBhagat841 for his thoughts. |
Any update on this? |
We'll update an issue when there's an update to be had. Unfortunately this never got roadmapped, so I'll mark it for roadmapping now. |
Does nomad report NVIDIA GPU details to the dashboard? |
This would be very useful to check what GPU types are offered by a node and if they are being picked up at all by the node (sometimes nvidia driver fails) |
Proposal
According to #9917 it seems as though Nomad is already aware of certain resource usage information from GPUs, so I think that a simple solution would be to query for this data and display it using the existing graph components.
Use-cases
As it stands, it is impossible to tell if a job is using GPU resources without SSH-ing into the machine and running a tool such as
nvidia-smi
. Even then, tools likenvidia-smi
do not directly tell you which job is using the resources, only the process. It would be nice if the status page for the job itself could let you know of the GPU resource usage which, in turn, could help in visualizing when those resources get completely consumed (at least until a solution for #9917 becomes available).Attempted Solutions
None of the resource metrics available in the UI expose the GPUs at all. The closest I've seen to knowing that GPUs are available at all is the docker driver info on the client status page. Even then, that only shows that the nvidia runtime is available.
The text was updated successfully, but these errors were encountered: