-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requesting "None" GPU Accelerator on NERC OpenShift AI (RHOAI) allocates all 4 GPUs #685
Comments
I looked at the pod definition, and there was no requests/limits for a gpu device at all. This means that it will not be restricted by the resourcequota either or can be billed for. After some looking around I came across the following from the nvidia k8s device plugin readme and according to that this maybe the expected behavior.
A workaround would be to not schedule pods that don't request GPUs on nodes with GPUs. I am not sure how far we got with that. |
If all notebooks for all users continue to be put into the rhods-notebooks namespace there's no way to set limits based only on the user allocation quota. |
@joachimweyl ah, yes. |
Is there an issue to check all existing projects to make sure they are set to 0? |
@gagansk is checking with RHOAI dev team. |
@gagansk any update from RHOAI? |
@dystewart has raised the issue with the RHOAI team on 08/28/2024. I am following up with the RHOAI IDE PM (Kezia Cook). |
@Milstein Some folks running internal AI cluster at Red Hat have run into this problem. I am relaying the messages I received here for your information.
Here's the fix: NVIDIA/gpu-operator#421 (comment) it can be made in the:
Object under |
@dystewart would you be the one to implement the fix Gagan found? |
@joachimweyl Yes! I'll test this fix |
Addresses: nerc-project/operations#685 In OCP-on-NERC#566 we tested this clusterpolicy config change with success. This PR will bring the change to the prod RHOAI installation.
Motivation
Using RHOAI user can use GPU resource on their workbench Jupyter Lab setup. There is option to request either GPU based on their Allocation Quota specified by "OpenShift Request on GPU Quota" OpenShift allocation quota attribute on their ColdFront Allocation or None for their workload as shown here:
But when selected None, that allow user to use all available GPUs without getting billed based on their usage:
Completion Criteria
When set to GPU "None" option then the workbench should have set:
With a resource request for
nvidia.com/gpu: 0
this environment variable should be set automatically.Also, when setting the GPU Accelerator to a specific count (Number of accelerators), ensure it aligns with the currently available GPU quota for the user namespace. It's possible that some GPUs are already in use by other workloads.
Description
Workaround Ref: NVIDIA/k8s-device-plugin#61
Completion dates
Desired - 2024-08-21
Required - 2024-09-11
The text was updated successfully, but these errors were encountered: