Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When preempt, the released resources cannot meet the scheduling needs #1309

Closed
hanlaipeng opened this issue Nov 6, 2023 · 11 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@hanlaipeng
Copy link

What happened:
I create two clusterqueue and set reclaimWithinCohort = Any, when preempt, the released resources cannot meet the scheduling needs.
For example, when i submit one (1 * 2GPUs) task,I want this task to get 2 GPUs on the same machine,however, the resources released after preemption are one machine has 0.5 GPUs and the other has 1.5 GPUs

截屏2023-11-06 21 34 31

What you expected to happen:

I want this task to get 2 GPUs on the same machine,

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always): v 0.4.1
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@hanlaipeng hanlaipeng added the kind/bug Categorizes issue or PR as related to a bug. label Nov 6, 2023
@kerthcet
Copy link
Contributor

kerthcet commented Nov 7, 2023

cc @B1F030 can you help to take a look?

@B1F030
Copy link
Member

B1F030 commented Nov 7, 2023

Could you please provide yaml of ResourceFlavor, ClusterQueue, and Job?
/triage needs-information
I guess you need to set an only-a-clusterqueue.
For example, there are two clusterqueues: gpu-a-cq and gpu-b-cq, they are in the same cohort, so that they can share resources with eachother.
When preemption happens, everything goes right. But when you want to create a workload that can only run on gpu-a-cq, you will need to create another clusterqueue, as only-a-cq, that not belongs to any cohort.
If you claim workload to the only-a-queue, this workload will only use gpu-a. Is that what you want?

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 7, 2023
@hanlaipeng
Copy link
Author

Could you please provide yaml of ResourceFlavor, ClusterQueue, and Job? /triage needs-information I guess you need to set an only-a-clusterqueue. For example, there are two clusterqueues: gpu-a-cq and gpu-b-cq, they are in the same cohort, so that they can share resources with eachother. When preemption happens, everything goes right. But when you want to create a workload that can only run on gpu-a-cq, you will need to create another clusterqueue, as only-a-cq, that not belongs to any cohort. If you claim workload to the only-a-queue, this workload will only use gpu-a. Is that what you want?

gpu-a-cq clusterqueue: :

spec:
cohort: gpu-hlp
namespaceSelector: {}
preemption:
reclaimWithinCohort: Any
withinClusterQueue: Never
queueingStrategy: StrictFIFO
resourceGroups:

  • coveredResources:
    • cpu
    • memory
    • nvidia.com/gpu
      flavors:
    • name: gpu-hlp
      resources:
      • borrowingLimit: "0"
        name: cpu
        nominalQuota: "57"
      • borrowingLimit: "0"
        name: memory
        nominalQuota: 209Gi
      • borrowingLimit: "0"
        name: nvidia.com/gpu
        nominalQuota: "8"

gpu-b-cq clusterqueue: :

spec:
cohort: gpu-hlp
namespaceSelector: {}
preemption:
reclaimWithinCohort: Never
withinClusterQueue: Never
queueingStrategy: StrictFIFO
resourceGroups:

  • coveredResources:
    • cpu
    • memory
    • nvidia.com/gpu
      flavors:
    • name: gpu-hlp
      resources:
      • borrowingLimit: "57"
        name: cpu
        nominalQuota: "0"
      • borrowingLimit: 209Gi
        name: memory
        nominalQuota: 0
      • borrowingLimit: "8"
        name: nvidia.com/gpu
        nominalQuota: "0"

resourceflavor:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
creationTimestamp: "2023-08-30T09:17:46Z"
finalizers:

  • kueue.x-k8s.io/resource-in-use
    generation: 1
    name: gpu-hlp
    resourceVersion: "3121760923"
    selfLink: /apis/kueue.x-k8s.io/v1beta1/resourceflavors/gpu-hlp
    uid: e48e90c9-5fdd-41d6-9f77-11cea43b5125
    spec: {}

resource pool has tow 4 GPUS nodes,i submit 2 GPUs * 1 pod job to gpu-a-cq, however the resources released after preemption are one machine has 0.5 GPUs and the other has 1.5 GPUs, can not meet the job scheduling needs

@B1F030
Copy link
Member

B1F030 commented Nov 7, 2023

Could you also paste the Job yaml too?

@hanlaipeng
Copy link
Author

Could you also paste the Job yaml too?

ok, just sample pod yaml, resource is:
Resources:
Limits:
Cpu: 2
Memory: 5Gi
nvidia.com/gpu: 2
Requests:
Cpu: 2
Memory: 5Gi
nvidia.com/gpu: 2

kueue preemption strategy is to preempt those with lower priorities first, and those with the same priority that have a shorter startup time are preempted first.

@B1F030
Copy link
Member

B1F030 commented Nov 7, 2023

Since your ResourceFlavor has nil spec, I recommend to use default-flavor.

Would you like to try these yamls below, and see if that problem happens again?

gpu-a-cq.yaml

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "gpu-a-cq"
spec:
  namespaceSelector: {} # match all.
  cohort: "gpu-ab"
  queueingStrategy: StrictFIFO
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 57
        borrowingLimit: 57
      - name: "memory"
        nominalQuota: 209Gi
        borrowingLimit: 209Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 8
        borrowingLimit: 8
  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerPriority

kubectl create -f gpu-a-cq.yaml

gpu-b-cq.yaml

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "gpu-b-cq"
spec:
  namespaceSelector: {} # match all.
  cohort: "gpu-ab"
  queueingStrategy: StrictFIFO
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 57
        borrowingLimit: 57
      - name: "memory"
        nominalQuota: 209Gi
        borrowingLimit: 209Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 8
        borrowingLimit: 8
  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerPriority

kubectl create -f gpu-b-cq.yaml

localqueue-a.yaml

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: gpu-a-queue
spec:
  clusterQueue: gpu-a-cq

kubectl create -f localqueue-a.yaml

localqueue-b.yaml

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: gpu-b-queue
spec:
  clusterQueue: gpu-b-cq

kubectl create -f localqueue-b.yaml

sample-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  generateName: sample-job-
  labels:
    kueue.x-k8s.io/queue-name: gpu-a-queue
spec:
  suspend: true
  template:
    spec:
      containers:
      - name: dummy-job
        image: gcr.io/k8s-staging-perf-tests/sleep:latest
        imagePullPolicy: IfNotPresent
        args: ["60s"]
        resources:
          limits:
            cpu: 2
            memory: "5Gi"
            nvidia.com/gpu: 2
          requests:
            cpu: 2
            memory: "5Gi"
            nvidia.com/gpu: 2
      restartPolicy: Never

kubectl create -f sample-job.yaml

@hanlaipeng
Copy link
Author

hanlaipeng commented Jan 8, 2024

i had solved this problem by adding node resource scheduling in our internal environment,can i solve this problem ? gratefully @kerthcet

@kerthcet
Copy link
Contributor

kerthcet commented Jan 8, 2024

Does the node resource scheduling means the nodeResourceFit plugin in kube-scheduler?

@hanlaipeng
Copy link
Author

Does the node resource scheduling means the nodeResourceFit plugin in kube-scheduler?

yes,we can add this plugin to solve the problem of the released resources cannot meet the scheduling needs after preempted

@kerthcet
Copy link
Contributor

kerthcet commented Jan 9, 2024

Generally, kueue can not solve this problem as kueue and kube-scheduler are two different components, they can't aware of each other, so let's close this so far. Thanks for your feedbacks.

And waitForPodsReady in kueue may help you here as well, the Job will be requeued after a period time of not-ready.
/close

@k8s-ci-robot
Copy link
Contributor

@kerthcet: Closing this issue.

In response to this:

Generally, kueue can not solve this problem as kueue and kube-scheduler are two different components, they can't aware of each other, so let's close this so far. Thanks for your feedbacks.

And waitForPodsReady in kueue may help you here as well, the Job will be requeued after a period time of not-ready.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants