When preempt, the released resources cannot meet the scheduling needs #1309

hanlaipeng · 2023-11-06T13:36:08Z

What happened:
I create two clusterqueue and set reclaimWithinCohort = Any, when preempt, the released resources cannot meet the scheduling needs.
For example, when i submit one (1 * 2GPUs) task，I want this task to get 2 GPUs on the same machine，however, the resources released after preemption are one machine has 0.5 GPUs and the other has 1.5 GPUs

What you expected to happen:

I want this task to get 2 GPUs on the same machine,

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Kueue version (use git describe --tags --dirty --always): v 0.4.1
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

kerthcet · 2023-11-07T03:23:06Z

cc @B1F030 can you help to take a look?

B1F030 · 2023-11-07T06:32:25Z

Could you please provide yaml of ResourceFlavor, ClusterQueue, and Job?
/triage needs-information
I guess you need to set an only-a-clusterqueue.
For example, there are two clusterqueues: gpu-a-cq and gpu-b-cq, they are in the same cohort, so that they can share resources with eachother.
When preemption happens, everything goes right. But when you want to create a workload that can only run on gpu-a-cq, you will need to create another clusterqueue, as only-a-cq, that not belongs to any cohort.
If you claim workload to the only-a-queue, this workload will only use gpu-a. Is that what you want?

hanlaipeng · 2023-11-07T06:43:33Z

Could you please provide yaml of ResourceFlavor, ClusterQueue, and Job? /triage needs-information I guess you need to set an only-a-clusterqueue. For example, there are two clusterqueues: gpu-a-cq and gpu-b-cq, they are in the same cohort, so that they can share resources with eachother. When preemption happens, everything goes right. But when you want to create a workload that can only run on gpu-a-cq, you will need to create another clusterqueue, as only-a-cq, that not belongs to any cohort. If you claim workload to the only-a-queue, this workload will only use gpu-a. Is that what you want?

gpu-a-cq clusterqueue: :

spec:
cohort: gpu-hlp
namespaceSelector: {}
preemption:
reclaimWithinCohort: Any
withinClusterQueue: Never
queueingStrategy: StrictFIFO
resourceGroups:

coveredResources:
- cpu
- memory
- nvidia.com/gpu
  flavors:
- name: gpu-hlp
  resources:
  - borrowingLimit: "0"
    name: cpu
    nominalQuota: "57"
  - borrowingLimit: "0"
    name: memory
    nominalQuota: 209Gi
  - borrowingLimit: "0"
    name: nvidia.com/gpu
    nominalQuota: "8"

gpu-b-cq clusterqueue: :

spec:
cohort: gpu-hlp
namespaceSelector: {}
preemption:
reclaimWithinCohort: Never
withinClusterQueue: Never
queueingStrategy: StrictFIFO
resourceGroups:

coveredResources:
- cpu
- memory
- nvidia.com/gpu
  flavors:
- name: gpu-hlp
  resources:
  - borrowingLimit: "57"
    name: cpu
    nominalQuota: "0"
  - borrowingLimit: 209Gi
    name: memory
    nominalQuota: 0
  - borrowingLimit: "8"
    name: nvidia.com/gpu
    nominalQuota: "0"

resourceflavor:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
creationTimestamp: "2023-08-30T09:17:46Z"
finalizers:

kueue.x-k8s.io/resource-in-use
generation: 1
name: gpu-hlp
resourceVersion: "3121760923"
selfLink: /apis/kueue.x-k8s.io/v1beta1/resourceflavors/gpu-hlp
uid: e48e90c9-5fdd-41d6-9f77-11cea43b5125
spec: {}

resource pool has tow 4 GPUS nodes，i submit 2 GPUs * 1 pod job to gpu-a-cq, however the resources released after preemption are one machine has 0.5 GPUs and the other has 1.5 GPUs, can not meet the job scheduling needs

B1F030 · 2023-11-07T07:01:01Z

Could you also paste the Job yaml too?

hanlaipeng · 2023-11-07T07:06:32Z

Could you also paste the Job yaml too?

ok, just sample pod yaml, resource is:
Resources:
Limits:
Cpu: 2
Memory: 5Gi
nvidia.com/gpu: 2
Requests:
Cpu: 2
Memory: 5Gi
nvidia.com/gpu: 2

kueue preemption strategy is to preempt those with lower priorities first, and those with the same priority that have a shorter startup time are preempted first.

B1F030 · 2023-11-07T07:55:24Z

Since your ResourceFlavor has nil spec, I recommend to use default-flavor.

Would you like to try these yamls below, and see if that problem happens again?

gpu-a-cq.yaml

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "gpu-a-cq"
spec:
  namespaceSelector: {} # match all.
  cohort: "gpu-ab"
  queueingStrategy: StrictFIFO
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 57
        borrowingLimit: 57
      - name: "memory"
        nominalQuota: 209Gi
        borrowingLimit: 209Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 8
        borrowingLimit: 8
  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerPriority

kubectl create -f gpu-a-cq.yaml

gpu-b-cq.yaml

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "gpu-b-cq"
spec:
  namespaceSelector: {} # match all.
  cohort: "gpu-ab"
  queueingStrategy: StrictFIFO
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 57
        borrowingLimit: 57
      - name: "memory"
        nominalQuota: 209Gi
        borrowingLimit: 209Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 8
        borrowingLimit: 8
  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerPriority

kubectl create -f gpu-b-cq.yaml

localqueue-a.yaml

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: gpu-a-queue
spec:
  clusterQueue: gpu-a-cq

kubectl create -f localqueue-a.yaml

localqueue-b.yaml

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: gpu-b-queue
spec:
  clusterQueue: gpu-b-cq

kubectl create -f localqueue-b.yaml

sample-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  generateName: sample-job-
  labels:
    kueue.x-k8s.io/queue-name: gpu-a-queue
spec:
  suspend: true
  template:
    spec:
      containers:
      - name: dummy-job
        image: gcr.io/k8s-staging-perf-tests/sleep:latest
        imagePullPolicy: IfNotPresent
        args: ["60s"]
        resources:
          limits:
            cpu: 2
            memory: "5Gi"
            nvidia.com/gpu: 2
          requests:
            cpu: 2
            memory: "5Gi"
            nvidia.com/gpu: 2
      restartPolicy: Never

kubectl create -f sample-job.yaml

hanlaipeng · 2024-01-08T06:37:42Z

i had solved this problem by adding node resource scheduling in our internal environment，can i solve this problem ? gratefully @kerthcet

kerthcet · 2024-01-08T10:18:49Z

Does the node resource scheduling means the nodeResourceFit plugin in kube-scheduler?

hanlaipeng · 2024-01-08T11:06:44Z

Does the node resource scheduling means the nodeResourceFit plugin in kube-scheduler?

yes，we can add this plugin to solve the problem of the released resources cannot meet the scheduling needs after preempted

kerthcet · 2024-01-09T03:19:10Z

Generally, kueue can not solve this problem as kueue and kube-scheduler are two different components, they can't aware of each other, so let's close this so far. Thanks for your feedbacks.

And waitForPodsReady in kueue may help you here as well, the Job will be requeued after a period time of not-ready.
/close

k8s-ci-robot · 2024-01-09T03:19:15Z

@kerthcet: Closing this issue.

In response to this:

Generally, kueue can not solve this problem as kueue and kube-scheduler are two different components, they can't aware of each other, so let's close this so far. Thanks for your feedbacks.

And waitForPodsReady in kueue may help you here as well, the Job will be requeued after a period time of not-ready.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hanlaipeng added the kind/bug Categorizes issue or PR as related to a bug. label Nov 6, 2023

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 7, 2023

k8s-ci-robot closed this as completed Jan 9, 2024

B1F030 mentioned this issue Feb 4, 2024

REQUEST: New membership for B1F030 kubernetes/org#4732

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When preempt, the released resources cannot meet the scheduling needs #1309

When preempt, the released resources cannot meet the scheduling needs #1309

hanlaipeng commented Nov 6, 2023

kerthcet commented Nov 7, 2023

B1F030 commented Nov 7, 2023

hanlaipeng commented Nov 7, 2023

B1F030 commented Nov 7, 2023

hanlaipeng commented Nov 7, 2023

B1F030 commented Nov 7, 2023

hanlaipeng commented Jan 8, 2024 •

edited

Loading

kerthcet commented Jan 8, 2024

hanlaipeng commented Jan 8, 2024

kerthcet commented Jan 9, 2024

k8s-ci-robot commented Jan 9, 2024

When preempt, the released resources cannot meet the scheduling needs #1309

When preempt, the released resources cannot meet the scheduling needs #1309

Comments

hanlaipeng commented Nov 6, 2023

kerthcet commented Nov 7, 2023

B1F030 commented Nov 7, 2023

hanlaipeng commented Nov 7, 2023

B1F030 commented Nov 7, 2023

hanlaipeng commented Nov 7, 2023

B1F030 commented Nov 7, 2023

hanlaipeng commented Jan 8, 2024 • edited Loading

kerthcet commented Jan 8, 2024

hanlaipeng commented Jan 8, 2024

kerthcet commented Jan 9, 2024

k8s-ci-robot commented Jan 9, 2024

hanlaipeng commented Jan 8, 2024 •

edited

Loading