Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting "None" GPU Accelerator on NERC OpenShift AI (RHOAI) allocates all 4 GPUs #685

Closed
1 task
Milstein opened this issue Aug 16, 2024 · 11 comments · Fixed by OCP-on-NERC/nerc-ocp-config#566
Assignees

Comments

@Milstein
Copy link

Milstein commented Aug 16, 2024

Motivation

Using RHOAI user can use GPU resource on their workbench Jupyter Lab setup. There is option to request either GPU based on their Allocation Quota specified by "OpenShift Request on GPU Quota" OpenShift allocation quota attribute on their ColdFront Allocation or None for their workload as shown here:

image

But when selected None, that allow user to use all available GPUs without getting billed based on their usage:

image

Completion Criteria

When set to GPU "None" option then the workbench should have set:

image

  limits:
    ...
    nvidia.com/gpu: "0"
  requests:
    ...
    nvidia.com/gpu: "0"

With a resource request for nvidia.com/gpu: 0 this environment variable should be set automatically.

Also, when setting the GPU Accelerator to a specific count (Number of accelerators), ensure it aligns with the currently available GPU quota for the user namespace. It's possible that some GPUs are already in use by other workloads.

image

Description

  • First step to resolve the issue

Workaround Ref: NVIDIA/k8s-device-plugin#61

Completion dates

Desired - 2024-08-21
Required - 2024-09-11

@naved001
Copy link

But when selected None, that allow user to use all available GPUs without getting billed based on their usage:

I looked at the pod definition, and there was no requests/limits for a gpu device at all. This means that it will not be restricted by the resourcequota either or can be billed for.

After some looking around I came across the following from the nvidia k8s device plugin readme and according to that this maybe the expected behavior.

WARNING: if you don't request GPUs when using the device plugin with NVIDIA images all the GPUs on the machine will be exposed inside your container.

A workaround would be to not schedule pods that don't request GPUs on nodes with GPUs. I am not sure how far we got with that.

@knikolla
Copy link

Also, when setting the GPU Accelerator to a specific count (Number of accelerators), ensure it aligns with the currently available GPU quota for the user namespace. It's possible that some GPUs are already in use by other workloads.

If all notebooks for all users continue to be put into the rhods-notebooks namespace there's no way to set limits based only on the user allocation quota.

@joachimweyl
Copy link
Contributor

@naved001 to confirm are you referring to this issue?

@naved001
Copy link

@joachimweyl ah, yes.

@msdisme
Copy link

msdisme commented Aug 19, 2024

Is there an issue to check all existing projects to make sure they are set to 0?

@Milstein Milstein assigned Milstein and unassigned Milstein Aug 21, 2024
@gagansk
Copy link

gagansk commented Aug 21, 2024

@gagansk is checking with RHOAI dev team.

@joachimweyl
Copy link
Contributor

@gagansk any update from RHOAI?

@gagansk
Copy link

gagansk commented Sep 13, 2024

@dystewart has raised the issue with the RHOAI team on 08/28/2024. I am following up with the RHOAI IDE PM (Kezia Cook).

@msdisme msdisme closed this as completed Sep 13, 2024
@msdisme msdisme reopened this Sep 13, 2024
@gagansk
Copy link

gagansk commented Sep 30, 2024

@Milstein Some folks running internal AI cluster at Red Hat have run into this problem. I am relaying the messages I received here for your information.

accelerator profiles don't do anything WRT admission. If your GPU nodes are untainted, the scheduler can deem them as "free real estate" and then workloads can consume against them even without requesting quota.

Here's the fix: NVIDIA/gpu-operator#421 (comment)

it can be made in the:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy

Object under .spec.toolkit.env

@joachimweyl
Copy link
Contributor

@dystewart would you be the one to implement the fix Gagan found?

@dystewart
Copy link

@joachimweyl Yes! I'll test this fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants