Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI: Display GPU resource usage #10779

Open
nicholascioli opened this issue Jun 18, 2021 · 9 comments
Open

UI: Display GPU resource usage #10779

nicholascioli opened this issue Jun 18, 2021 · 9 comments

Comments

@nicholascioli
Copy link
Contributor

Proposal

According to #9917 it seems as though Nomad is already aware of certain resource usage information from GPUs, so I think that a simple solution would be to query for this data and display it using the existing graph components.

Use-cases

As it stands, it is impossible to tell if a job is using GPU resources without SSH-ing into the machine and running a tool such as nvidia-smi. Even then, tools like nvidia-smi do not directly tell you which job is using the resources, only the process. It would be nice if the status page for the job itself could let you know of the GPU resource usage which, in turn, could help in visualizing when those resources get completely consumed (at least until a solution for #9917 becomes available).

Attempted Solutions

None of the resource metrics available in the UI expose the GPUs at all. The closest I've seen to knowing that GPUs are available at all is the docker driver info on the client status page. Even then, that only shows that the nvidia runtime is available.

2021-06-17-235123_867x442_scrot

@tgross
Copy link
Member

tgross commented Jun 18, 2021

The closest I've seen to knowing that GPUs are available at all is the docker driver info on the client status page. Even then, that only shows that the nvidia runtime is available.

Just as a point of clarification, Docker has its own nvidia runtime support which we can allow via allow_runtimes but that's not the same as Nomad's Nvidia device driver support used as an example in the device block documentation.

According to #9917 it seems as though Nomad is already aware of certain resource usage information from GPUs, so I think that a simple solution would be to query for this data and display it using the existing graph components

I like this idea... but the challenge with presenting that in a reasonable way is that Nomad doesn't actually know that a device is a GPU at all! Nomad's GPU support here is via our device driver protocol. The device driver protocol has a Stats method that returns a generic collection of stats objects but Nomad doesn't have any semantic information about this. The stats could just as easily be from a USB driver or an e-Paper display as they are from a GPU.

We're also likely to externalize the Nvidia GPU device driver so that it doesn't ship with Nomad by default. See #8330. It doesn't get a lot of use and makes building Nomad on some architectures (ex. ARM) and some Linux distros (ex. Alpine) really painful.

@nicholascioli
Copy link
Contributor Author

I like this idea... but the challenge with presenting that in a reasonable way is that Nomad doesn't actually know that a device is a GPU at all!

Interestingly enough (at least in the case of an NVIDIA card), the device plugin exposes the following information on the node status:

/v1/node/ID

"NodeResources": {
    "Devices": [
      {
        "Name": "NVIDIA GeForce GTX 1060 with Max-Q Design",
        "Type": "gpu",
        "Vendor": "nvidia",
        ...
      }
    ],
    ...
}

/v1/client/stats

{
  "DeviceStats": [
    {
      "InstanceStats": {
        "GPU-XXX-YYY-ZZZ": {
          "Summary": {
            "Desc": "UsedMemory / TotalMemory",
            "IntDenominatorVal": 6078,
            "IntNumeratorVal": 6,
            "Unit": "MiB"
          },
          ...
        }
      },
      "Name": "NVIDIA GeForce GTX 1060 with Max-Q Design",
      "Type": "gpu",
      "Vendor": "nvidia",
      ...
    }
  ]
}

Notice that in both cases the type is specified as gpu. Could we not use this information to at least have a nice total overview? Or is this just a coincidence that NVIDIA exposes the type while others may not?

We're also likely to externalize the Nvidia GPU device driver so that it doesn't ship with Nomad by default.

Oh interesting. So would it be better to add this functionality per external device plugin? Or is it not possible for a device to expose UI metrics in a consistent way?

@tgross
Copy link
Member

tgross commented Jun 21, 2021

Notice that in both cases the type is specified as gpu. Could we not use this information to at least have a nice total overview? Or is this just a coincidence that NVIDIA exposes the type while others may not?

That NodeDeviceResource data comes from the fingerprint. The values are just opaque strings to Nomad, but apparently we do special-case the string "gpu" so we could at least detect if the device was a GPU.

But the stats themselves are coming from the the Stats RPC, and that is just returning the objects from stats.go . They're defined by the individual device plugin. So we'd need to add a way for device plugins to tell the Nomad server how to display their graphs (ex. "which one of these values is the thing you should graph for memory"), which runs into the same problem we have for task drivers in #10088 where each client can have a different version of the plugin.

@nicholascioli
Copy link
Contributor Author

Ah, I see how this gets more complicated. How difficult would it be to attempt this? It seems like the issue linked as a dependency for #10088 (#606) has been closed and I would not mind taking a stab at this.

I'm finding that I need to be able to see these stats so I would not mind trying to implement this.

@tgross
Copy link
Member

tgross commented Jun 22, 2021

Well given the discussion above it looks like all the required APIs are already available, so it'd be mostly web UI work. Here I have to admit I don't have a good intuition for the difficulty level and I'm going to ping my colleague @JBhagat841 for his thoughts.

@shivamwaghela
Copy link

Any update on this?

@tgross tgross added this to Nomad UI Sep 18, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Nomad UI Sep 18, 2024
@tgross
Copy link
Member

tgross commented Sep 18, 2024

We'll update an issue when there's an update to be had. Unfortunately this never got roadmapped, so I'll mark it for roadmapping now.

@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Sep 18, 2024
@Berl-cloud
Copy link

Does nomad report NVIDIA GPU details to the dashboard?

@chamini2
Copy link

This would be very useful to check what GPU types are offered by a node and if they are being picked up at all by the node (sometimes nvidia driver fails)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Status: Backlog
Development

No branches or pull requests

5 participants