-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extend quotas to include device plugins such as GPUs #9917
Comments
Hi @henrikjohansen, and thanks for this suggestion! I've slightly edited the title just so that we don't confuse the ENT |
@tgross 👍 Well, your comment is precisely why I included the last example in ☝️. A simple count of the number of device instances is probably more realistic than exposing fine grained resources for all relevant device types? 🤔 |
@tgross This is getting a more and more prevalent issue for us since we have no way to control the utilization of GPU resources amongst our tenants leading to all sorts of problems. I would ❤️ to see the resource quotas feature of Nomad Enterprise enhanced so that the number of available GPUs could be limited per namespace 👇
|
@tgross Just a friendly reminder that this now has grown into a major problem for us. As a Nomad Enterprise customer I am somewhat disappointed about the fact the we cannot guard our rarest and most expensive resource using quotas. |
Hey @henrikjohansen it's great that you have this issue open for us engineers to track and discuss feasibility. You may want to escalate with your account rep if you want to put some fire under it in terms of prioritization. |
This issue is very relevant for us. Is there a way we could contribute, since its enterprise feature? |
... just my yearly reminder that we are still patiently waiting for this. |
@henrikjohansen again, as an Enterprise customer you can best nudge on this from your account manager, so that there's a formal internal Feature Request. |
hi @henrikjohansen, I marked the issue as resolved, since the changes have been merged into |
The existing resource quotas cover entities such as MHz compute and MB memory but device plugins such as GPUs are currently not supported.
This is a real problem as a single job essentially can consume all available GPU resources leading to a cluster-wide starvation of critical resources.
GPU resource are already being fingerprintet :
The ideal solution would be to extend the fingerprinting to include the number of GPU cores and expose those for use in resource quotas - this would bring GPU quotas on par with system CPU and memory quotas (GPU resources could also be exposed as Mhz instead of core count to make those identical to CPU resources).
For example :
But simply being able to restrict the number of devices using a resource quota would also be acceptable for now - for example :
The text was updated successfully, but these errors were encountered: