Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Kubeflow Jobs type for resource quota reclaim #1503

Closed
2 of 3 tasks
panpan0000 opened this issue Dec 21, 2023 · 10 comments
Closed
2 of 3 tasks

Support Kubeflow Jobs type for resource quota reclaim #1503

panpan0000 opened this issue Dec 21, 2023 · 10 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@panpan0000
Copy link
Contributor

panpan0000 commented Dec 21, 2023

What would you like to be added:

reclaim the resource quota when kubeflow jobs(PytorchJob, TFJobs...etc) completed.
so far, when the Job CRD existing while the pod completed, the queue's flavorUsage will still be occupied , and stop incoming pods/jobs .

#(1) show queue usage before job running 
kubectl  get localqueue -o yaml
    flavorUsage:
    - name: default-flavor
      resources:
      - name: cpu
        total: "1"
      - name: memory
        total: 2Gi
      - name: nvidia.com/gpu
        total: "1"

#(2) run a Pytorchjob with 3 pods which sleep for 1s, requesting 4c-2g-1GPU for each.
 kubectl  get po  -w
NAME                 READY   STATUS      RESTARTS   AGE
job-sleep-master-0   0/1     Completed   0          14s
job-sleep-worker-0   0/1     Completed   0          16s
job-sleep-worker-1   0/1     Completed   0          15s

 kubectl  get pytorchjob
NAME        STATE       AGE
job-sleep   Succeeded   4m32s


#(3)show queue usage again after job completed 
kubectl  get localqueue -o yaml
    flavorUsage:
    - name: default-flavor
      resources:
      - name: cpu
        total: "13"
      - name: memory
        total: 8Gi
      - name: nvidia.com/gpu
        total: "4"

Why is this needed:

it makes no sense to kill the xxjob right after finish, we still need the Pytorchjob CRD for some reason.
but kueue should reclaim the resource to reduce the flavorUsage after job completed.

relevant issue : #1149

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@panpan0000 panpan0000 added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 21, 2023
@kerthcet
Copy link
Contributor

cc @kerthcet @B1F030

@anishasthana
Copy link
Contributor

@panpan0000 Do you have a sample job CR I could use for testing? I'd be happy to take a stab at this.

@panpan0000
Copy link
Contributor Author

panpan0000 commented Dec 22, 2023

it's very easy to reproduce the issue but requires training-operator from kubeflow instsalled:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  annotations:
  labels:
    kueue.x-k8s.io/queue-name: $YOUR_LOCAL_QUEUE_NAME_HERE
  name: echo-job
  namespace: default
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 3
      template:
        metadata: {}
        spec:
          containers:
          - command: ["sleep", "1"]
            image: python:3.12.0
            name: pytorch
            resources:
              limits:
                cpu: "1"
                memory: 1Gi
                nvidia.com/gpu: 1
              requests:
                cpu: "1"
                memory: 1Gi
                nvidia.com/gpu: 1
  runPolicy:
    suspend: false

@tenzen-y
Copy link
Member

@panpan0000 I could not reproduce this issue. How to install kueue and training-operator? Also, which versions do you use kueue, training-operator, and kubernetes?

@panpan0000
Copy link
Contributor Author

kueue 0.5.1
training-operator: v1.7.0
k8s: v1.27.5

@tenzen-y
Copy link
Member

kueue 0.5.1 training-operator: v1.7.0 k8s: v1.27.5

Also, how did you install those components? Could you provide reproducing steps?

@panpan0000
Copy link
Contributor Author

It's wired ... I just tried again the issue gone ...
maybe I was using kueue 0.5.0 at that time...
sorry for the confusion. I will reopen if the issue happen again.....

sorry again @tenzen-y

@tenzen-y
Copy link
Member

It's wired ... I just tried again the issue gone ... maybe I was using kueue 0.5.0 at that time... sorry for the confusion. I will reopen if the issue happen again.....

sorry again @tenzen-y

No problem :)

/remove-kind feature
/kind support

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Dec 27, 2023
@B1F030
Copy link
Member

B1F030 commented Dec 29, 2023

I tried in kubernetes v1.27.3
To install kubeflow:
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
Under both kueue v0.5.0 and v0.5.1, I fail to reproduce it, the resource quota reclaim works well.
Maybe this situation is caused by other configuration, so I'm going to just record this process here.

@tenzen-y
Copy link
Member

I tried in kubernetes v1.27.3 To install kubeflow: kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone" Under both kueue v0.5.0 and v0.5.1, I fail to reproduce it, the resource quota reclaim works well. Maybe this situation is caused by other configuration, so I'm going to just record this process here.

It's a great recording :) Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

6 participants