Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

[preemption/reclaim] preemption/reclaim not work properly when there is gang job. #446

Closed
runqch opened this issue Oct 17, 2018 · 3 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Milestone

Comments

@runqch
Copy link

runqch commented Oct 17, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

What happened:
preemption/reclaim not work properly.

What you expected to happen:

  1. job should not preempt other job if it still can not run after preemption.
  2. after resource released, it should can be used by others

How to reproduce it (as minimally and precisely as possible):
ENV: 60 cores

  1. submit 1st job to occupy 60 cores, with minMember=1
  2. submit 2nd job, requiring 60 cores, with minMember=60 (gang job)
  3. continue to submit 3rd job, requiring 60 cores, with minMember=1

1st job:

 runqch@ib22b10-534: cat job1.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: qj-1
spec:
  backoffLimit: 60
  completions: 60
  parallelism: 60
  template:
    metadata:
      annotations:
        scheduling.k8s.io/group-name: qj-1
    spec:
      containers:
      - image: busybox
        imagePullPolicy: IfNotPresent
        name: busybox
        command:
           - sleep
           - "300"
        resources:
          requests:
            cpu: "1"
      restartPolicy: Never
      schedulerName: kube-batch
---
apiVersion: scheduling.incubator.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: qj-1
spec:
  minMember: 1   

 runqch@ib22b10-547: kubectl create -f ./job1.yaml
job.batch/qj-1 created
podgroup.scheduling.incubator.k8s.io/qj-1 created

 runqch@ib22b10-557: kubectl get pods | grep qj-1 | wc -l
60
 runqch@ib22b10-558: kubectl get pods | grep qj-1 | grep Running | wc -l   **<== all 60 pods running**
60

2nd job:

 runqch@ib22b10-559: cat job2.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: qj-2
spec:
  backoffLimit: 60
  completions: 60
  parallelism: 60
  template:
    metadata:
      annotations:
        scheduling.k8s.io/group-name: qj-2
    spec:
      containers:
      - image: busybox
        imagePullPolicy: IfNotPresent
        name: busybox
        command:
           - sleep
           - "2000"
        resources:
          requests:
            cpu: "1"
      restartPolicy: Never
      schedulerName: kube-batch
---
apiVersion: scheduling.incubator.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: qj-2
spec:
  minMember: 60

 runqch@ib22b10-560: kubectl create -f ./job2.yaml
job.batch/qj-2 created
podgroup.scheduling.incubator.k8s.io/qj-2 created


 runqch@ib22b10-564: kubectl get pods | grep qj-2 | grep Running | wc -l    
0
 runqch@ib22b10-565: kubectl get pods | grep qj-1 | grep Running | wc -l  
30
 runqch@ib22b10-563: kubectl get pods | grep qj-2 | wc -l   
60

===>>> From above, we can see job1 was preempted by job2, 30 cores free-ed by job1, but job2 can not go due to minMember restriction. The expected behavior is: job should not preempt other job if it can not go after preemption

wait for a while, after 30 pod completed, worse conditons happens. After 30 pods of job1 completed, the released cores should can be re-used by the other 30 pods, but actually, only 1 pod continue run. Wired.

 runqch@ib22b10-573: kubectl get pods | grep qj-1 | grep Running | wc -l
1
 runqch@ib22b10-574: kubectl get pods | grep qj-1 | grep Completed | wc -l
30
 runqch@ib22b10-575: kubectl get pods | grep qj-1 | grep Pending | wc -l
29
 runqch@ib22b10-576: kubectl get pods | grep qj-2 | grep Pending | wc -l
60

3rd job:

 runqch@ib22b10-586: cat job3.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: qj-3
spec:
  backoffLimit: 60
  completions: 60
  parallelism: 60
  template:
    metadata:
      annotations:
        scheduling.k8s.io/group-name: qj-3
    spec:
      containers:
      - image: busybox
        imagePullPolicy: IfNotPresent
        name: busybox
        command:
           - sleep
           - "2000"
        resources:
          requests:
            cpu: "1"
      restartPolicy: Never
      schedulerName: kube-batch
---
apiVersion: scheduling.incubator.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: qj-3
spec:
  minMember: 1
 runqch@ib22b10-577: kubectl create -f ./job3.yaml
job.batch/qj-3 created
podgroup.scheduling.incubator.k8s.io/qj-3 created

 runqch@ib22b10-578: kubectl get pods | grep qj-3 | grep Pending | wc -l
60
 runqch@ib22b10-582: kubectl get pods | grep qj-2 | grep Pending | wc -l
60
 runqch@ib22b10-583: kubectl get pods | grep qj-1 | grep Running | wc -l
1

>>> the left 30 cores either can be used by job2, or job3. Resource is idle there.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.11.3
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): Linux ib22b10 3.10.0-862.6.3.el7.x86_64 typo fixes #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k82cn
Copy link
Contributor

k82cn commented Oct 17, 2018

thanks very for your report :) That's an issue that we did not handle well, let me fix it.

/assign

@k82cn
Copy link
Contributor

k82cn commented Oct 17, 2018

/kind bug
/sig scheduling
/milestone v0.3

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Oct 17, 2018
@k82cn k82cn added this to the v0.3 milestone Oct 18, 2018
@k82cn
Copy link
Contributor

k82cn commented Dec 21, 2018

fixed by #457 #505

@k82cn k82cn closed this as completed Dec 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

3 participants