Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend quotas to include device plugins such as GPUs #9917

Closed
henrikjohansen opened this issue Jan 29, 2021 · 9 comments · Fixed by #23894
Closed

extend quotas to include device plugins such as GPUs #9917

henrikjohansen opened this issue Jan 29, 2021 · 9 comments · Fixed by #23894

Comments

@henrikjohansen
Copy link

The existing resource quotas cover entities such as MHz compute and MB memory but device plugins such as GPUs are currently not supported.

This is a real problem as a single job essentially can consume all available GPU resources leading to a cluster-wide starvation of critical resources.

GPU resource are already being fingerprintet :

$ nomad node status -verbose 429c30c7 
...
Device Group Attributes
Device Group     = nvidia/gpu/Tesla T4
bar1             = 256 MiB
cores_clock      = 1590 MHz
display_state    = Enabled
driver_version   = 460.32.03
memory_clock     = 5001 MHz
memory           = 15109 MiB
pci_bandwidth    = 15760 MB/s
persistence_mode = Disabled
power            = 70 W
...

The ideal solution would be to extend the fingerprinting to include the number of GPU cores and expose those for use in resource quotas - this would bring GPU quotas on par with system CPU and memory quotas (GPU resources could also be exposed as Mhz instead of core count to make those identical to CPU resources).

For example :

limit {
  region = "global"
  region_limit {
    cpu = 2500
    memory = 1000

    nvidia {
      cores = 2560   # number of CUDA cores
      memory = 15000 # MB GPU Memory
    }
  }
}

But simply being able to restrict the number of devices using a resource quota would also be acceptable for now - for example :

limit {
  region = "global"
  region_limit {
    cpu = 2500
    memory = 1000

    nvidia {
      devices = 2 # number of GPU devices
    }
  }
}
@tgross tgross changed the title [feature] extend resource quotas to include device plugins such as GPUs extend quotas to include device plugins such as GPUs Jan 29, 2021
@tgross
Copy link
Member

tgross commented Jan 29, 2021

Hi @henrikjohansen, and thanks for this suggestion! I've slightly edited the title just so that we don't confuse the ENT quotas feature for the resource block. There's some interesting trickiness to this because devices are plugins, so the resources they expose are fairly arbitrary from the perspective of the scheduler. Should be interesting to figure out!

@henrikjohansen
Copy link
Author

henrikjohansen commented Jan 29, 2021

@tgross 👍 Well, your comment is precisely why I included the last example in ☝️. A simple count of the number of device instances is probably more realistic than exposing fine grained resources for all relevant device types? 🤔

@henrikjohansen
Copy link
Author

@tgross This is getting a more and more prevalent issue for us since we have no way to control the utilization of GPU resources amongst our tenants leading to all sorts of problems.

I would ❤️ to see the resource quotas feature of Nomad Enterprise enhanced so that the number of available GPUs could be limited per namespace 👇

limit {
  region = "global"
  region_limit {
    cpu = 2500
    memory = 1000

    nvidia {
      devices = 2 # number of GPU devices
    }
  }
}

@henrikjohansen
Copy link
Author

@tgross Just a friendly reminder that this now has grown into a major problem for us. As a Nomad Enterprise customer I am somewhat disappointed about the fact the we cannot guard our rarest and most expensive resource using quotas.

@tgross
Copy link
Member

tgross commented Jul 6, 2023

Hey @henrikjohansen it's great that you have this issue open for us engineers to track and discuss feasibility. You may want to escalate with your account rep if you want to put some fire under it in terms of prioritization.

@illyakaynov
Copy link

This issue is very relevant for us. Is there a way we could contribute, since its enterprise feature?

@henrikjohansen
Copy link
Author

... just my yearly reminder that we are still patiently waiting for this.

@tgross
Copy link
Member

tgross commented Jul 10, 2024

@henrikjohansen again, as an Enterprise customer you can best nudge on this from your account manager, so that there's a formal internal Feature Request.

@pkazmierczak
Copy link
Contributor

hi @henrikjohansen, I marked the issue as resolved, since the changes have been merged into main. The feature will land in Nomad Enterprise 1.9.0, due to be released Oct 14th.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants