Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog about pod scheduling readiness #37436

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 128 additions & 0 deletions content/en/blog/_posts/2022-12-26-pod-scheduling-readiness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
layout: blog
title: "Kubernetes 1.26: Pod Scheduling Readiness"
date: 2022-12-26
slug: pod-scheduling-readiness-alpha
---

**Author:** Wei Huang (Apple), Abdullah Gharaibeh (Google)

Kubernetes 1.26 introduced a new Pod feature: _scheduling gates_. In Kubernetes, scheduling gates
are keys that tell the scheduler when a Pod is ready to be considered for scheduling.

## What problem does it solve?

When a Pod is created, the scheduler will continuously attempt to find a node that fits it. This
infinite loop continues until the scheduler either finds a node for the Pod, or the Pod gets deleted.

Pods that remain unschedulable for long periods of time (e.g., ones that are blocked on some external event)
waste scheduling cycles. A scheduling cycle may take ≅20ms or more depending on the complexity of
the Pod's scheduling constraints. Therefore, at scale, those wasted cycles significantly impact the
scheduler's performance. See the arrows in the "scheduler" box below.

{{< mermaid >}}
graph LR;
pod((New Pod))-->queue
subgraph Scheduler
queue(scheduler queue)
sched_cycle[/scheduling cycle/]
schedulable{schedulable?}

queue==>|Pop out|sched_cycle
sched_cycle==>schedulable
schedulable==>|No|queue
subgraph note [Cycles wasted on keep rescheduling 'unready' Pods]
end
end

classDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff;
classDef Scheduler fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
classDef note fill:#edf2ae,stroke:#fff,stroke-width:1px;
class queue,sched_cycle,schedulable k8s;
class pod plain;
class note note;
class Scheduler Scheduler;
{{< /mermaid >}}

Scheduling gates helps address this problem. It allows declaring that newly created Pods are not
ready for scheduling. When scheduling gates are present on a Pod, the scheduler ignores the Pod
and therefore saves unnecessary scheduling attempts. Those Pods will also be ignored by Cluster
Autoscaler if you have it installed in the cluster.

Clearing the gates is the responsibility of external controllers with knowledge of when the Pod
should be considered for scheduling (e.g., a quota manager).

{{< mermaid >}}
graph LR;
pod((New Pod))-->queue
subgraph Scheduler
queue(scheduler queue)
sched_cycle[/scheduling cycle/]
schedulable{schedulable?}
popout{Pop out?}

queue==>|PreEnqueue check|popout
popout-->|Yes|sched_cycle
popout==>|No|queue
sched_cycle-->schedulable
schedulable-->|No|queue
subgraph note [A knob to gate Pod's scheduling]
end
end

classDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff;
classDef Scheduler fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
classDef note fill:#edf2ae,stroke:#fff,stroke-width:1px;
classDef popout fill:#f96,stroke:#fff,stroke-width:1px;
class queue,sched_cycle,schedulable k8s;
class pod plain;
class note note;
class popout popout;
class Scheduler Scheduler;
{{< /mermaid >}}

## How does it work?

Scheduling gates in general works very similar to Finalizers. Pods with a non-empty
`spec.schedulingGates` field will show as status `SchedulingGated` and be blocked from
scheduling. Note that more than one gate can be added, but they all should be added upon Pod
creation (e.g., you can add them as part of the spec or via a mutating webhook).

```
NAME READY STATUS RESTARTS AGE
test-pod 0/1 SchedulingGated 0 10s
```

To clear the gates, you update the Pod by removing all of the items from the Pod's `schedulingGates`
field. The gates do not need to be removed all at once, but only when all the gates are removed the
scheduler will start to consider the Pod for scheduling.

Under the hood, scheduling gates are implemented as a PreEnqueue scheduler plugin, a new scheduler
framework extension point that is invoked at the beginning of each scheduling cycle.

## Use Cases

An important use case this feature enables is dynamic quota management. Kubernetes supports
[ResourceQuota](/docs/concepts/policy/resource-quotas/), however the API Server enforces quota at
the time you attempt Pod creation. For example, if a new Pod exceeds the CPU quota, it gets rejected.
The API Server doesn't queue the Pod; therefore, whoever created the Pod needs to continuously attempt
to recreate it again. This either means a delay between resources becoming available and the Pod
actually running, or it means load on the API server and Scheduler due to constant attempts.

Scheduling gates allows an external quota manager to address the above limitation of ResourceQuota.
Specifically, the manager could add a `example.com/quota-check` scheduling gate to all Pods created in the
cluster (using a mutating webhook). The manager would then remove the gate when there is quota to
start the Pod.

## Whats next?

To use this feature, the `PodSchedulingReadiness` feature gate must be enabled in the API Server
and scheduler. You're more than welcome to test it out and tell us (SIG Scheduling) what you think!

## Additional resources

- [Pod Scheduling Readiness](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-scheduling-readiness/)
in the Kubernetes documentation
- [Kubernetes Enhancement Proposal](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness/README.md)