Skip to content

Commit

Permalink
Merge pull request #1069 from tallclair/runtimeclass-sched
Browse files Browse the repository at this point in the history
Runtimeclass scheduling updates
  • Loading branch information
k8s-ci-robot authored May 31, 2019
2 parents 1d5bf18 + ce0daf8 commit 02d49b9
Showing 1 changed file with 55 additions and 59 deletions.
114 changes: 55 additions & 59 deletions keps/sig-node/runtime-class-scheduling.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,11 @@ date.

## Proposal

A new optional `Topology` structure will be added to the RuntimeClass API. The
topology includes both a `NodeSelector` and `Tolerations` that control how a pod
using that RuntimeClass is scheduled. The NodeSelector rules are applied during
scheduling, but the Tolerations are added to the PodSpec during admission by the
new RuntimeClass admission controller.
A new optional `Scheduling` structure will be added to the RuntimeClass API. The
scheduling struct includes both a `NodeSelector` and `Tolerations` that control
how a pod using that RuntimeClass is scheduled. The NodeSelector rules are
applied during scheduling, but the Tolerations are added to the PodSpec during
admission by the new RuntimeClass admission controller.

### User Stories

Expand Down Expand Up @@ -134,16 +134,17 @@ requirements about which trusted services are run on that node.

### API

The RuntimeClass definition is augmented with an optional `Topology` struct:
The RuntimeClass definition is augmented with an optional `Scheduling` struct:

```go
type Topology struct {
// nodeSelector selects the set of nodes that support this RuntimeClass.
// Pods using this RuntimeClass can only be scheduled to a node matched by
// this selector. The nodeSelector is intersected (AND) with a pod's other
// node affinity or node selector requirements.
type Scheduling struct {
// nodeSelector lists labels that must be present on nodes that support this
// RuntimeClass. Pods using this RuntimeClass can only be scheduled to a
// node matched by this selector. The RuntimeClass nodeSelector is merged
// with a pod's existing nodeSelector. Any conflicts will cause the pod to
// be rejected in admission.
// +optional
NodeSelector corev1.NodeSelector
NodeSelector map[string]string

// tolerations adds tolerations to pods running with this RuntimeClass.
// +optional
Expand All @@ -159,11 +160,12 @@ expressive set of requirements and preferences. NodeSelectorRequirements are a
small subset of the NodeAffinity rules, that place intersecting requirements on
a NodeSelectorTerm.

Since the RuntimeClass topology represents hard requirements (the node supports
the RuntimeClass or it doesn't), the topology API should not include scheduling
Since the RuntimeClass scheduling rules represent hard requirements (the node
supports the RuntimeClass or it doesn't), the scheduling API should not include
preferences, ruling out NodeAffinity. The NodeSelector type is much more
expressive than the `map[string]string` selector, so the RuntimeClass topology
embraces that more powerful API.
expressive than the `map[string]string` selector, but the top-level union logic
makes merging NodeSelectors messy (requires a cross-product). For simplicity,
we went with the simple requirements.

[NodeAffinity]: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity

Expand All @@ -175,7 +177,7 @@ from nodes. If every pod had a RuntimeClass and every RuntimeClass had a strict
NodeSelector, then RuntimeClasses could use non-overlapping selectors in place
of taints & tolerations. However the same could be said of regular pod
selectors, yet taints & tolerations are still a useful addition. Examples of
[use cases](#user-stories) for including tolerations in RuntimeClass topology
[use cases](#user-stories) for including tolerations in RuntimeClass scheduling
inculde:

- Tainting Windows nodes with `windows.microsoft.com` to keep default linux pods
Expand All @@ -186,53 +188,26 @@ inculde:
RuntimeClasses would need a toleration to enable them to be run on those
nodes.

### Scheduler

A new scheduler predicate will manage the RuntimeClass topology. It will lookup
the RuntimeClass associated with the pod being scheduled. If there is no
RuntimeClass, or the RuntimeClass does not include a topology, then the
predicate will permit the pod to be scheduled to the evaluated node. Otherwise,
it will check whether the NodeSelector matches the node.

Adding a dedicated RuntimeClass predicate rather than mixing the rules in to the
NodeAffinity evaluation means that in the event a pod is unschedulable there
will be a clear explanation of why. For example:

```
0/10 Nodes are available: 5 nodes do not have enough memory, 5 nodes don't match RuntimeClass
```

If the pod's referenced RuntimeClass does not exist at scheduling time, the
RuntimeClass predicate will fail. The scheduler will periodically retry, and
once the RuntimeClass is (re)created, the pod will be scheduled.

### RuntimeClass Admission Controller

The RuntimeClass admission controller is a new default-enabled in-tree admission
plugin. The role of the controller for scheduling is to merge the Tolerations
from the RuntimeClass into the PodSpec. Eventually, the controller's
plugin. The role of the controller for scheduling is to merge the scheduling
rules from the RuntimeClass into the PodSpec. Eventually, the controller's
responsibilities may grow, such as to merge in [pod overhead][] or validate
feature compatibility.

Merging the RuntimeClass NodeSelector into the PodSpec NodeSelector is handled
by adding the key-value pairs from the RuntimeClass version to the PodSpec
version. If both have the same key with a different value, then the admission
controller will reject the pod.

Merging tolerations is straight forward, as we want to _union_ the RuntimeClass
tolerations with the existing tolerations, which matches the default toleration
composition logic. This means that RuntimeClass tolerations can simply be
appended to the existing tolerations, but an [existing
utilty][merge-tolerations] can reduce duplicates by merging equivalent
tolerations.

Merging tolerations during admission rather than scheduling has several
advantages:

1. Unlike the NodeSelector, Tolerations are also checked when managing
[TaintBasedEvictions]. As a RuntimeClass could be deleted after a pod has
already been created, the tolerations should be mixed into the PodSpec to
ensure that they stay attached to the running pod.
2. Merging tolerations in admission enables other admission controllers (such as
a [scheduling policy][SchedulingPolicy] controller) to perform validation
based on those tolerations. This matters less for the NodeSelector, as that
is narrowing the scheduling domain rather than expanding it.

If the pod's referenced RuntimeClass does not exist, the admission controller
will reject the pod. This is necessary to ensure the pod is run with the
expected behavior.
Expand All @@ -247,13 +222,13 @@ to label nodes is very environment dependent. Here are several examples:

- If runtimes are setup as part of node setup, then the node template should
include the appropriate labels & taints.
- If runtimes are installed through a DaemonSet, then the topology should match
- If runtimes are installed through a DaemonSet, then the scheduling should match
that of the DaemonSet.
- If runtimes are manually installed, or installed through some external
process, that same process should apply an appropriate label to the node.

If the RuntimeClass topology has security implications, special care should be
taken when choosing labels. In particular, labels with the
If the RuntimeClass scheduling rules have security implications, special care
should be taken when choosing labels. In particular, labels with the
`[*.]node-restriction.kubernetes.io/` prefix cannot be added with the node's
identity, and labels with the `[*.]k8s.io/` or `[*.]kubernetes.io/` prefixes
cannot be modified by the node. For more details, see [Bounding Self-Labeling
Expand All @@ -278,9 +253,29 @@ the alpha phase. This means the feature is expected to be beta quality at launch

## Alternatives

### Scheduler

A new scheduler predicate could manage the RuntimeClass scheduling. It would
lookup the RuntimeClass associated with the pod being scheduled. If there is no
RuntimeClass, or the RuntimeClass does not include a scheduling struct, then the
predicate would permit the pod to be scheduled to the evaluated node. Otherwise,
it would check whether the NodeSelector matches the node.

Adding a dedicated RuntimeClass predicate rather than mixing the rules in to the
NodeAffinity evaluation means that in the event a pod is unschedulable there
would be a clear explanation of why. For example:

```
0/10 Nodes are available: 5 nodes do not have enough memory, 5 nodes don't match RuntimeClass
```

If the pod's referenced RuntimeClass does not exist at scheduling time, the
RuntimeClass predicate would fail. The scheduler would periodically retry, and
once the RuntimeClass is (re)created, the pod would be scheduled.

### RuntimeController Mix-in

Rather than resolving the topology in the scheduler, the `NodeSelectorTerm`
Rather than resolving scheduling in the scheduler, the `NodeSelectorTerm`
rules and `Tolerations` are mixed in to the PodSpec. The mix-in happens in the
mutating admission phase, and is performed by a new `RuntimeController` built-in
admission plugin. The same admission controller is shared with the [Pod
Expand All @@ -292,7 +287,7 @@ Overhead][] proposal.

RuntimeController is a new in-tree admission plugin that should eventually be
enabled on almost every Kubernetes clusters. The role of the controller for
scheduling is to merge the topology constraints from the RuntimeClass into the
scheduling is to merge the scheduling constraints from the RuntimeClass into the
PodSpec. Eventually, the controller's responsibilities may grow, such as to
merge in [pod overhead][] or validate feature compatibility.

Expand All @@ -301,12 +296,13 @@ support](#runtimeclass-aware-scheduling).

#### Mix-in

The RuntimeClass topology is merged with the pod's NodeSelector & Tolerations.
The RuntimeClass scheduling rules are merged with the pod's NodeSelector &
Tolerations.

**NodeSelectorRequirements**

To avoid multiplicitive scaling of the NodeSelectorTerms, the
`RuntimeClass.Topology.NodeSelector *v1.NodeSelector` field is replaced with
`RuntimeClass.Scheduling.NodeSelector *v1.NodeSelector` field is replaced with
`NodeSelectorTerm *v1.NodeSelectorTerm`.

The term's NodeSelectorRequirements are merged into the pod's node affinity
Expand Down Expand Up @@ -363,7 +359,7 @@ status. Taking this approach would require native RuntimeClass-aware scheduling.

The visibility advantage could be achieved through different methods. For
example, a special request or tool could be implemented that would list all the
nodes that match a RuntimeClasess topology.
nodes that match a RuntimeClasses scheduling rules.

### Scheduling Policy

Expand Down

0 comments on commit 02d49b9

Please sign in to comment.