Skip to content

Commit

Permalink
Proposal: Label Sync Between MachineDeployment and underlying
Browse files Browse the repository at this point in the history
Kubernetes Nodes
  • Loading branch information
Arvinderpal committed Mar 13, 2022
1 parent f8f6583 commit 338e17b
Showing 1 changed file with 161 additions and 0 deletions.
161 changes: 161 additions & 0 deletions docs/proposals/20220210-md-node-label-sync.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
---

title: Label Sync Between MachineDeployment and underlying Kubernetes Nodes
authors:
- "@arvinderpal"
- "@enxebre"
reviewers:
- "@fabriziopandini"
- "@neolit123"
- "@sbueringer"
- "@vincepri"
- "@randomvariable"
creation-date: 2022-02-10
last-updated: 2022-02-26
status: implementable
replaces:
superseded-by:

---

# Label Sync Between MachineDeployment and underlying Kubernetes Nodes

## Glossary

Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).

## Summary

Labels on worker nodes are commonly used for node role assignment and workload placement. This document discusses a means by which labels placed on MachineDeployments can be kept in sync with labels on the underlying Kubernetes Nodes.

## Motivation

Kubelet supports self-labeling of nodes via the `--node-labels` flag. CAPI users can specify these labels via `kubeletExtraArgs.node-labels` in the KubeadmConfigTemplate. There are a few shortcomings:
- Security concerns have restricted the allowed set of labels; In particular, kubelet's ability to self label in the `*.kubernetes.io/` and `*.k8s.io/` namespaces is mostly prohibited. Using prefixes outside these namespaces (i.e. unrestricted prefixes) is discouraged.
- Labels are only applicable at creation time. Adding/Removing a label would require a MachineDeployment rollout. This is undesirable especially in baremetal environments where provisioning can be time consuming.

The documented approach for specifying restricted labels in CAPI is to utilize kubectl. This introduces the potential for human error and also goes against the general declarative model adopted by CAPI.

Users could also implement their own label synchronizer in their tooling, but this may not be practical for most.

### Goals

- Support synchronizing labels that fall under the following restricted prefixes: `node-role.kubernetes.io/*` and `node-restriction.kubernetes.io/*`
- Support synchronizing labels that fall under a CAPI specific prefix: `node.cluster.k8s.io/*`

### Non-Goals/Future Work

- Label sync for control-plane nodes.
- Support for arbitrary/user-specified label prefixes.

## Proposal

CAPI will define its own label prefix which falls within the [restricted namespace *.k8s.io/](https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/279-limit-node-access):

```
node.cluster.k8s.io/*
```

Labels placed on the Machine that match the above prefix will be authoritative -- Cluster API will act as the ultimate authority for anything under the prefix. Labels with the above prefix should be set only within Cluster API CRDs.

In addition to the above, CAPI will support one-time application of the following labels during Machine creation:

```
node-role.kubernetes.io/*
node-restriction.kubernetes.io/*
```

The `node-role.kubernetes.io` prefix has been used widely in the past to identify the role of a node (e.g. `node-role.kubernetes.io/worker=''`). For example, `kubectl get node` looks for this specific label when displaying the role to the user.

The use of the `node-restriction.kubernetes.io/*` prefix is recommended in the [Kubernetes docs](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#noderestriction) for things such as workload placement.

It's important to note that the these specific labels are *not* kept in sync, but rather applied only once after the Machine and its corresponding Node are first created. Taking authoritative ownership of these prefixes would preclude other entities from adding and removing such labels on the Node.

### User Stories

#### Story 1

As a cluster admin/user, I would like a declarative and secure means by which to assign roles to my nodes. For example, when I do `kubectl get nodes`, I want the ROLES column to list my assigned roles for the nodes.

#### Story 2

As a cluster admin/user, for the purpose of workload placement, I would like a declarative and secure means by which I can add/remove labels on groups of nodes. For example, I want to run my outward facing nginx pods on a specific set of nodes (e.g. md-secure-zone) to reduce exposure to the company's publicly trusted server certificates.

### Implementation Details/Notes/Constraints

#### Label propogation from MachineDeployment to Machine

Utilize CAPI metadata Propagation. CAPI supports [metadata propagation](https://cluster-api.sigs.k8s.io/developer/architecture/controllers/metadata-propagation.html?highlight=propagation#metadata-propagation) across core API resources. For the MachineDeployment types, template labels propagate to MachineSets top-level and MachineSets template metadata:
```
MachineDeployment.spec.template.metadata.labels => MachineSet.labels, MachineSet.spec.template.metadata.labels
```
Similarly, on MachineSet, template labels propagate to Machines:
```
MachineSet.spec.template.metadata.labels => Machine.labels
```

As of this writing, changing the `MachineDeployment.spec.template.metadata.labels` will trigger a rollout. This is undesirable. CAPI will be updated to instead ignore updates of labels that fall within the CAPI prefix. There is precedence for this with the handling of `MachineDeploymentUniqueLabel`.

#### Synchronization of CAPI Labels

Labels that fall under the CAPI specific prefix `node.cluster.k8s.io/*` will need to be kept in sync between the Machine and Node objects. Synchronization must handle these two scenarios:
(A) Label is added/removed on a Machine and must be added/removed on the corresponding Node.
(B) Label is added/removed on the Node and must be removed/re-added on the Node to bring it in sync with the labels on Machine.


#### Machine Controller Changes

On each Machine reconcile event, the machine controller will fetch the corresponding Node (i.e. Machine.Status.NodeRef). If the Machine has labels specified, then the controller will ensure those labels are applied to the Node.

Similarly, today the machine controller watches Nodes within the workload cluster, where changes to the Node trigger a reconcile of its corresponding Machine object. This should allow us to capture changes made directly to the labels on the Node (e.g. user accidently deletes a label that falls under CAPI's managed prefix).

#### One-time Apply of Standard Kubernetes Labels

The machine controller will apply the standard kubernetes labels, if specified, to the Node immediately after the Node enters the Ready state (ProviderID is set...).

#### Delay between Node Create and Label Sync

The Node object is first created when kubeadm joins a node to the workload cluster (i.e. kubelet is up and running). There may be a delay (potentially several seconds) before the LSC kicks in to apply the labels on the Node.

Kubernetes supports both equality and inequality requirements in label selection. In an equality based selection, the user wants to place a workload on node(s) matching a specific label (e.g. Node.Labels contains `my.prefix/foo=bar`). The delay in applying the label on the node, may cause a subsequent delay in the placement of the workload, but this is likely acceptable.

In an inequality based selection, the user wants to place a workload on node(s) that do not contain a specific label (e.g. Node.Labels not contain `my.prefix/foo=bar`). The case is potentially problematic because it relies on the absence of a label and this can occur if the pod scheduler runs during the delay interval.

One way to address this is to use kubelet's `--register-with-taints` flag. We could use that to taint a newly minted Node with say `cluster.x-k8s.io=label-sync-pending:NoSchedule`. Assuming workloads don't have this specific toleration, then nothing should be scheduled. When the LSC synchronizes labels, it removes this specific taint.

KubeadmConfigTemplate provides the means to set taints on nodes (see JoinConfiguration.NodeRegistration.Taints). We have some options as to how best to “insert” this specific taint:
(1) Leave it to the user to specify this exact taint in their KubeadmConfigTemplate manifests
(2) CABPK inserts it by default in all join configurations.

Option (1) is the simplest and could be documented as so:

```
If you utilize inequality based selection for workload placement, to prevent unintended scheduling of pods during the initial node startup phase, it is recommend that you specify the following taint in your KubeadmConfigTemplate:
`cluster.x-k8s.io=label-sync-pending:NoSchedule`
```

## Alternatives

### Label propogation from MachineDeployment to Machine

Option #1: Introduce a new field Spec.NodeLifecycle.Labels

Instead of using the top-level metadata, a new field would be introduced:

```
MachineDeployment.Spec.NodeLifecycle.Labels => MachineSet.Spec.NodeLifecycle.Labels => Machine.Spec.NodeLifecycle.Labels
```

The benefit being that it provides a clear indication to the user that these labels will be synced to the Kubernetes Node(s). This will, however, require api changes.

Option #2: Propogate labels in the top-level metadata

We could propogate labels in the top-level metadata. These are however are not currently propagated. One option is to only propogate top-level labels matching the aforementioned prefixes. The propogation would follow a similar path to that described earlier:

```
MachineDeployment.labels => MachineSet.labels => Machine.labels
```

## Implementation History

- [ ] MM/DD/YYYY: Proposed idea in an issue or [community meeting]

0 comments on commit 338e17b

Please sign in to comment.