Support Kubeflow Jobs type for resource quota reclaim #1503

panpan0000 · 2023-12-21T08:57:16Z

What would you like to be added:

reclaim the resource quota when kubeflow jobs(PytorchJob, TFJobs...etc) completed.
so far, when the Job CRD existing while the pod completed, the queue's flavorUsage will still be occupied , and stop incoming pods/jobs .

#(1) show queue usage before job running 
kubectl  get localqueue -o yaml
    flavorUsage:
    - name: default-flavor
      resources:
      - name: cpu
        total: "1"
      - name: memory
        total: 2Gi
      - name: nvidia.com/gpu
        total: "1"

#(2) run a Pytorchjob with 3 pods which sleep for 1s, requesting 4c-2g-1GPU for each.
 kubectl  get po  -w
NAME                 READY   STATUS      RESTARTS   AGE
job-sleep-master-0   0/1     Completed   0          14s
job-sleep-worker-0   0/1     Completed   0          16s
job-sleep-worker-1   0/1     Completed   0          15s

 kubectl  get pytorchjob
NAME        STATE       AGE
job-sleep   Succeeded   4m32s


#(3)show queue usage again after job completed 
kubectl  get localqueue -o yaml
    flavorUsage:
    - name: default-flavor
      resources:
      - name: cpu
        total: "13"
      - name: memory
        total: 8Gi
      - name: nvidia.com/gpu
        total: "4"

Why is this needed:

it makes no sense to kill the xxjob right after finish, we still need the Pytorchjob CRD for some reason.
but kueue should reclaim the resource to reduce the flavorUsage after job completed.

relevant issue : #1149

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

The text was updated successfully, but these errors were encountered:

kerthcet · 2023-12-21T09:29:36Z

cc @kerthcet @B1F030

anishasthana · 2023-12-21T14:39:40Z

@panpan0000 Do you have a sample job CR I could use for testing? I'd be happy to take a stab at this.

panpan0000 · 2023-12-22T02:11:17Z

it's very easy to reproduce the issue but requires training-operator from kubeflow instsalled:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  annotations:
  labels:
    kueue.x-k8s.io/queue-name: $YOUR_LOCAL_QUEUE_NAME_HERE
  name: echo-job
  namespace: default
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 3
      template:
        metadata: {}
        spec:
          containers:
          - command: ["sleep", "1"]
            image: python:3.12.0
            name: pytorch
            resources:
              limits:
                cpu: "1"
                memory: 1Gi
                nvidia.com/gpu: 1
              requests:
                cpu: "1"
                memory: 1Gi
                nvidia.com/gpu: 1
  runPolicy:
    suspend: false

tenzen-y · 2023-12-22T05:09:49Z

@panpan0000 I could not reproduce this issue. How to install kueue and training-operator? Also, which versions do you use kueue, training-operator, and kubernetes?

panpan0000 · 2023-12-27T08:53:34Z

kueue 0.5.1
training-operator: v1.7.0
k8s: v1.27.5

tenzen-y · 2023-12-27T09:01:03Z

kueue 0.5.1 training-operator: v1.7.0 k8s: v1.27.5

Also, how did you install those components? Could you provide reproducing steps?

panpan0000 · 2023-12-27T13:27:53Z

It's wired ... I just tried again the issue gone ...
maybe I was using kueue 0.5.0 at that time...
sorry for the confusion. I will reopen if the issue happen again.....

sorry again @tenzen-y

tenzen-y · 2023-12-27T14:04:48Z

It's wired ... I just tried again the issue gone ... maybe I was using kueue 0.5.0 at that time... sorry for the confusion. I will reopen if the issue happen again.....

sorry again @tenzen-y

No problem :)

/remove-kind feature
/kind support

B1F030 · 2023-12-29T08:27:41Z

I tried in kubernetes v1.27.3
To install kubeflow:
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
Under both kueue v0.5.0 and v0.5.1, I fail to reproduce it, the resource quota reclaim works well.
Maybe this situation is caused by other configuration, so I'm going to just record this process here.

tenzen-y · 2023-12-29T08:29:20Z

I tried in kubernetes v1.27.3 To install kubeflow: kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone" Under both kueue v0.5.0 and v0.5.1, I fail to reproduce it, the resource quota reclaim works well. Maybe this situation is caused by other configuration, so I'm going to just record this process here.

It's a great recording :) Thanks.

panpan0000 added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 21, 2023

panpan0000 closed this as completed Dec 27, 2023

k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Dec 27, 2023

B1F030 mentioned this issue Feb 4, 2024

REQUEST: New membership for B1F030 kubernetes/org#4732

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Kubeflow Jobs type for resource quota reclaim #1503

Support Kubeflow Jobs type for resource quota reclaim #1503

panpan0000 commented Dec 21, 2023 •

edited

Loading

kerthcet commented Dec 21, 2023

anishasthana commented Dec 21, 2023

panpan0000 commented Dec 22, 2023 •

edited

Loading

tenzen-y commented Dec 22, 2023

panpan0000 commented Dec 27, 2023

tenzen-y commented Dec 27, 2023

panpan0000 commented Dec 27, 2023

tenzen-y commented Dec 27, 2023

B1F030 commented Dec 29, 2023

tenzen-y commented Dec 29, 2023

Support Kubeflow Jobs type for resource quota reclaim #1503

Support Kubeflow Jobs type for resource quota reclaim #1503

Comments

panpan0000 commented Dec 21, 2023 • edited Loading

kerthcet commented Dec 21, 2023

anishasthana commented Dec 21, 2023

panpan0000 commented Dec 22, 2023 • edited Loading

tenzen-y commented Dec 22, 2023

panpan0000 commented Dec 27, 2023

tenzen-y commented Dec 27, 2023

panpan0000 commented Dec 27, 2023

tenzen-y commented Dec 27, 2023

B1F030 commented Dec 29, 2023

tenzen-y commented Dec 29, 2023

panpan0000 commented Dec 21, 2023 •

edited

Loading

panpan0000 commented Dec 22, 2023 •

edited

Loading