Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc for Alpha feature PodSchedulingReadiness #37675

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions content/en/docs/concepts/scheduling-eviction/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ of terminating one or more Pods on Nodes.
* [Scheduling Framework](/docs/concepts/scheduling-eviction/scheduling-framework)
* [Scheduler Performance Tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/)
* [Resource Bin Packing for Extended Resources](/docs/concepts/scheduling-eviction/resource-bin-packing/)
* [Pod Scheduling Readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/)

## Pod Disruption

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: Pod Scheduling Readiness
content_type: concept
weight: 40
---

<!-- overview -->

{{< feature-state for_k8s_version="v1.26" state="alpha" >}}

Pods were considered ready for scheduling once created. Kubernetes scheduler
does its due diligence to find nodes to place all pending Pods. However, in a
real-world case, some Pods may stay in a "miss-essential-resources" state for a long period.
These Pods actually churn the scheduler (and downstream integrators like Cluster AutoScaler)
in an unnecessary manner.

By specifying/removing a Pod's `.spec.schedulingGates`, you can control when a Pod is ready
to be considered for scheduling.

<!-- body -->

## Configuring Pod schedulingGates

The `schedulingGates` field contains a list of strings, and each string literal is perceived as a
criteria that Pod should be satisfied before considered schedulable. This field can be initialized
only when a Pod is created (either by the client, or mutated during admission). After creation,
each schedulingGate can be removed in arbitrary order, but addition of a new scheduling gate is disallowed.

{{<mermaid>}}
stateDiagram-v2
s1: pod created
s2: pod scheduling gated
s3: pod scheduling ready
s4: pod running
if: empty scheduling gates?
state if <<choice>>
[*] --> s1
s1 --> if
s2 --> if: scheduling gate removed
if --> s2: no
if --> s3: yes
s3 --> s4
s4 --> [*]
{{< /mermaid >}}

## Usage example

To mark a Pod not-ready for scheduling, you can create it with one or more scheduling gates like this:

{{< codenew file="pods/pod-with-scheduling-gates.yaml" >}}

After the Pod's creation, you can check its state using:

```bash
kubectl get pod test-pod
```

The output reveals it's in `SchedulingGated` state:

```none
NAME READY STATUS RESTARTS AGE
test-pod 0/1 SchedulingGated 0 7s
```

You can also check its `schedulingGates` field by running:

```bash
kubectl get pod test-pod -o jsonpath='{.spec.schedulingGates}'
```

The output is:

```none
[{"name":"foo"},{"name":"bar"}]
```

To inform scheduler this Pod is ready for scheduling, you can remove its `schedulingGates` entirely
by re-applying a modified manifest:

{{< codenew file="pods/pod-without-scheduling-gates.yaml" >}}

You can check if the `schedulingGates` is cleared by running:

```bash
kubectl get pod test-pod -o jsonpath='{.spec.schedulingGates}'
```

The output is expected to be empty. And you can check its latest status by running:

```bash
kubectl get pod test-pod -o wide
```

Given the test-pod doesn't request any CPU/memory resources, it's expected that this Pod's state get
transited from previous `SchedulingGated` to `Running`:

```none
NAME READY STATUS RESTARTS AGE IP NODE
test-pod 1/1 Running 0 15s 10.0.0.4 node-2
```

## Observability

The metric `scheduler_pending_pods` comes with a new label `"gated"` to distinguish whether a Pod
has been tried scheduling but claimed as unschedulable, or explicitly marked as not ready for
scheduling. You can use `scheduler_pending_pods{queue="gated"}` to check the metric result.

## {{% heading "whatsnext" %}}

* Read the [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness) for more details
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@ For a reference to old feature gates that are removed, please refer to
| `PodDeletionCost` | `true` | Beta | 1.22 | |
| `PodDisruptionConditions` | `false` | Alpha | 1.25 | - |
| `PodHasNetworkCondition` | `false` | Alpha | 1.25 | |
| `PodSchedulingReadiness` | `false` | Alpha | 1.26 | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a description for this gate (at the end of this page) as we do for other features.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. For now it's a placeholder.

| `ProbeTerminationGracePeriod` | `false` | Alpha | 1.21 | 1.21 |
| `ProbeTerminationGracePeriod` | `false` | Beta | 1.22 | 1.24 |
| `ProbeTerminationGracePeriod` | `true` | Beta | 1.25 | |
Expand Down Expand Up @@ -652,6 +653,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
pod stats from the CRI container runtime rather than gathering them from cAdvisor.
- `PodDisruptionConditions`: Enables support for appending a dedicated pod condition indicating that the pod is being deleted due to a disruption.
- `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods.
- `PodSchedulingReadiness`: Enable setting `schedulingGates` field to control a Pod's [scheduling readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness).
- `PodSecurity`: Enables the `PodSecurity` admission plugin.
- `PreferNominatedNode`: This flag tells the scheduler whether the nominated
nodes will be checked first before looping through all the other nodes in
Expand Down
11 changes: 11 additions & 0 deletions content/en/examples/pods/pod-with-scheduling-gates.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
schedulingGates:
- name: foo
- name: bar
containers:
- name: pause
image: registry.k8s.io/pause:3.6
8 changes: 8 additions & 0 deletions content/en/examples/pods/pod-without-scheduling-gates.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: pause
image: registry.k8s.io/pause:3.6