Requesting "None" GPU Accelerator on NERC OpenShift AI (RHOAI) allocates all 4 GPUs #685

Milstein · 2024-08-16T14:38:24Z

Motivation

Using RHOAI user can use GPU resource on their workbench Jupyter Lab setup. There is option to request either GPU based on their Allocation Quota specified by "OpenShift Request on GPU Quota" OpenShift allocation quota attribute on their ColdFront Allocation or None for their workload as shown here:

But when selected None, that allow user to use all available GPUs without getting billed based on their usage:

Completion Criteria

When set to GPU "None" option then the workbench should have set:

  limits:
    ...
    nvidia.com/gpu: "0"
  requests:
    ...
    nvidia.com/gpu: "0"

With a resource request for nvidia.com/gpu: 0 this environment variable should be set automatically.

Also, when setting the GPU Accelerator to a specific count (Number of accelerators), ensure it aligns with the currently available GPU quota for the user namespace. It's possible that some GPUs are already in use by other workloads.

Description

First step to resolve the issue

Workaround Ref: NVIDIA/k8s-device-plugin#61

Completion dates

Desired - 2024-08-21
Required - 2024-09-11

The text was updated successfully, but these errors were encountered:

naved001 · 2024-08-16T14:45:16Z

But when selected None, that allow user to use all available GPUs without getting billed based on their usage:

I looked at the pod definition, and there was no requests/limits for a gpu device at all. This means that it will not be restricted by the resourcequota either or can be billed for.

After some looking around I came across the following from the nvidia k8s device plugin readme and according to that this maybe the expected behavior.

WARNING: if you don't request GPUs when using the device plugin with NVIDIA images all the GPUs on the machine will be exposed inside your container.

A workaround would be to not schedule pods that don't request GPUs on nodes with GPUs. I am not sure how far we got with that.

knikolla · 2024-08-16T14:52:08Z

Also, when setting the GPU Accelerator to a specific count (Number of accelerators), ensure it aligns with the currently available GPU quota for the user namespace. It's possible that some GPUs are already in use by other workloads.

If all notebooks for all users continue to be put into the rhods-notebooks namespace there's no way to set limits based only on the user allocation quota.

joachimweyl · 2024-08-16T14:52:34Z

@naved001 to confirm are you referring to this issue?

naved001 · 2024-08-16T14:56:53Z

@joachimweyl ah, yes.

msdisme · 2024-08-19T01:17:51Z

Is there an issue to check all existing projects to make sure they are set to 0?

gagansk · 2024-08-21T19:37:27Z

@gagansk is checking with RHOAI dev team.

joachimweyl · 2024-09-09T19:38:10Z

@gagansk any update from RHOAI?

gagansk · 2024-09-13T13:10:18Z

@dystewart has raised the issue with the RHOAI team on 08/28/2024. I am following up with the RHOAI IDE PM (Kezia Cook).

gagansk · 2024-09-30T19:23:40Z

@Milstein Some folks running internal AI cluster at Red Hat have run into this problem. I am relaying the messages I received here for your information.

accelerator profiles don't do anything WRT admission. If your GPU nodes are untainted, the scheduler can deem them as "free real estate" and then workloads can consume against them even without requesting quota.

Here's the fix: NVIDIA/gpu-operator#421 (comment)

it can be made in the:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy

Object under .spec.toolkit.env

joachimweyl · 2024-10-02T12:45:50Z

@dystewart would you be the one to implement the fix Gagan found?

dystewart · 2024-10-07T21:03:01Z

@joachimweyl Yes! I'll test this fix

Addresses: nerc-project/operations#685 In OCP-on-NERC#566 we tested this clusterpolicy config change with success. This PR will bring the change to the prod RHOAI installation.

joachimweyl mentioned this issue Aug 16, 2024

Prevent general workloads from scheduling on the GPU nodes? #495

Closed

5 tasks

Milstein assigned Milstein and unassigned Milstein Aug 21, 2024

msdisme closed this as completed Sep 13, 2024

msdisme reopened this Sep 13, 2024

joachimweyl assigned dystewart Oct 2, 2024

dystewart mentioned this issue Oct 11, 2024

nerc-ocp-test: patch gpu clusterPolicy to fix NONE acceleratorProfile OCP-on-NERC/nerc-ocp-config#566

Merged

dystewart closed this as completed in OCP-on-NERC/nerc-ocp-config#566 Oct 15, 2024

dystewart closed this as completed in OCP-on-NERC/nerc-ocp-config@8e55b75 Oct 15, 2024

This was referenced Dec 3, 2024

nerc-ocp-prod: Fix None acceleratorProfile in RHOAI OCP-on-NERC/nerc-ocp-config#618

Draft

NERC Maintenance (Taint nerc-ocp-prod GPUs and add acceleratorProfiles) - Jan 7 #849

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting "None" GPU Accelerator on NERC OpenShift AI (RHOAI) allocates all 4 GPUs #685

Requesting "None" GPU Accelerator on NERC OpenShift AI (RHOAI) allocates all 4 GPUs #685

Milstein commented Aug 16, 2024 •

edited by joachimweyl

Loading

naved001 commented Aug 16, 2024

knikolla commented Aug 16, 2024

joachimweyl commented Aug 16, 2024

naved001 commented Aug 16, 2024

msdisme commented Aug 19, 2024

gagansk commented Aug 21, 2024

joachimweyl commented Sep 9, 2024

gagansk commented Sep 13, 2024

gagansk commented Sep 30, 2024

joachimweyl commented Oct 2, 2024

dystewart commented Oct 7, 2024

Requesting "None" GPU Accelerator on NERC OpenShift AI (RHOAI) allocates all 4 GPUs #685

Requesting "None" GPU Accelerator on NERC OpenShift AI (RHOAI) allocates all 4 GPUs #685

Comments

Milstein commented Aug 16, 2024 • edited by joachimweyl Loading

Motivation

Completion Criteria

Description

Completion dates

naved001 commented Aug 16, 2024

knikolla commented Aug 16, 2024

joachimweyl commented Aug 16, 2024

naved001 commented Aug 16, 2024

msdisme commented Aug 19, 2024

gagansk commented Aug 21, 2024

joachimweyl commented Sep 9, 2024

gagansk commented Sep 13, 2024

gagansk commented Sep 30, 2024

joachimweyl commented Oct 2, 2024

dystewart commented Oct 7, 2024

Milstein commented Aug 16, 2024 •

edited by joachimweyl

Loading