docs: RFC for disruption.forceDrainAfter feature

kubernetes-sigs · Dec 6, 2023 · e75de87 · e75de87
1 parent d940f18
commit e75de87
Showing 1 changed file with 123 additions and 0 deletions.
diff --git a/designs/disruption-termination-grace-period.md b/designs/disruption-termination-grace-period.md
@@ -0,0 +1,123 @@
+# Disruption Termination Grace Period
+
+## Motivation
+Users are requesting the ability to control how long karpenter will wait for the deletion of an individual node to complete before forcefully terminating the node regardless of pod status. [#743](https://github.com/kubernetes-sigs/karpenter/issues/743). This supports two primary use cases.
+* Cluster admins who want to ensure that nodes are cycled after a given period of time, regardless of user-defined disruption controls (such as PDBs or preStop hooks) that might prevent eviction of a pod beyond the configured limit. This could be to satisfy security requirements or for convenience.
+* Cluster admins who want to allow users to protect long-running jobs from being interrupted by node disruption, up to a configured limit.
+
+This design piggybacks off the existing disruption-controls design, adding in a new `terminationGracePeriodSeconds` field to the `v1beta/NodePool` `Disruption` spec. This field seems intuitively related to the other disruption controls that exist within the disruption block.
+
+## Proposed Spec
+
+```yaml
+apiVersion: karpenter.sh/v1beta1
+kind: NodePool
+metadata:
+  name: default
+spec: # This is not a complete NodePool Spec.
+  disruption:
+    consolidationPolicy: WhenUnderutilized || WhenEmpty
+    consolidateAfter: 10m || Never # metav1.Duration
+    expireAfter: 10m || Never # Equivalent to v1alpha5 TTLSecondsUntilExpired
+    terminationGracePeriodSeconds: 24h || nil
+```
+
+```
+$ kubectl explain nodepool.spec.disruption.terminationGracePeriodSeconds
+KIND:     NodePool
+VERSION:  v1beta1
+
+FIELD:    terminationGracePeriodSeconds <integer>
+
+DESCRIPTION:
+     Optional duration in seconds the node is provided to terminate gracefully. May be
+     decreased in delete request. Value must be non-negative integer. The value
+     zero indicates that all pods should be terminated immediately, bypassing eviction
+     (including any PDBs). If this value is nil, deletion of the node will wait indefinitely
+     for pods to terminate gracefully before finalizing cleanup. Set this value to ensure
+     that nodes are deletion after a certain amount of time if your cluster operation
+     practices need it. Can also be used to provide a hard limit for users to extend the
+     run time of jobs during node deletion. By default (nil), node deletion waits for
+     the graceful termination of all pods.
+```
+
+## Code Definition
+
+```go
+type Disruption struct {
+    {...}
+    // TerminationGracePeriodSeconds is a nillable duration, parsed as a metav1.Duration.
+    // A nil value means there is no timeout before pods will be forcibly evicted during node deletion.
+    TerminationGracePeriodSeconds *NillableDuration `json:"terminationGracePeriodSeconds" hash:"ignore"`
+}
+```
+
+## Validation/Defaults
+The `terminationGracePeriodSeconds` fields accepts a common duration string which defaults to a value of `nil`. Omitting the field results in the default value instructing the controller to wait indefinitely for pods to drain gracefully, maintaining the existing behavior.
+A value of 0 instructs the controller to evict all pods by force immediately, matching the behavior of `pod.spec.terminationGracePeriodSeconds`
+
+## Prior Art
+This is the field from CAPI's MachineDeployment spec which implements similar behavior.
+```
+$ kubectl explain machinedeployment.spec.template.spec.nodeDrainTimeout
+KIND:     MachineDeployment
+VERSION:  cluster.x-k8s.io/v1beta1
+
+FIELD:    nodeDrainTimeout <string>
+
+DESCRIPTION:
+     NodeDrainTimeout is the total amount of time that the controller will spend
+     on draining a node. The default value is 0, meaning that the node can be
+     drained without any time limitations. NOTE: NodeDrainTimeout is different
+     from `kubectl drain --timeout`
+```
+
+```
+$ kubectl explain pod.spec.terminationGracePeriodSeconds
+KIND:     Pod
+VERSION:  v1
+
+FIELD:    terminationGracePeriodSeconds <integer>
+
+DESCRIPTION:
+     Optional duration in seconds the pod needs to terminate gracefully. May be
+     decreased in delete request. Value must be non-negative integer. The value
+     zero indicates stop immediately via the kill signal (no opportunity to shut
+     down). If this value is nil, the default grace period will be used instead.
+     The grace period is the duration in seconds after the processes running in
+     the pod are sent a termination signal and the time when the processes are
+     forcibly halted with a kill signal. Set this value longer than the expected
+     cleanup time for your process. Defaults to 30 seconds.
+```
+
+## Implementation
+
+### Termination Grace Period Expiration Detection
+Because Karpenter already ensures that nodes and their nodeclaims are in a deleting state before performing a node drain during node deletion, we should be able to leverage existing `deletionTimestamp` fields to avoid the need for an addition annotation or other tracking label.
+I believe we should use the NodeClaim `deletionTimestamp` specifically to avoid depending on the Node's `deletionTimestamp`, because [Kubernetes docs](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) recommend draining a node *before* deleting it, which is counter to how Karpenter behaves today (relying on Node finalizers to handle cleanup).
+
+If drains are not already periodically requeued, we may need to either modify the current drain wait logic to periodically check if a node has exceeded its termination grace period.
+
+### Termination Grace Period Expiration Behavior
+1. Node deletion occurs (user initiated, node rotation from drift, etc).
+2. NodeClaim deletionTimestamp is set.
+3. Standard node drain process begins.
+4. Time passes, configured terminationGracePeriod is exceeding for the current node.
+5. Remaining pods are forcibly drained, ignoring eviction (bypassing PDBs, preStop hooks, pod terminationGracePeriod, etc)
+6. Node is terminated at the cloud provider, regardless of the status of remaining pods (such as daemonsets that couldn't be force drained).
+
+"Force Drain" behavior should be similar to using kubectl to bypass standard eviction protections.
+This [official mechanism](https://kubernetes.io/blog/2022/12/16/kubernetes-1-26-non-graceful-node-shutdown-beta/#how-does-it-work) is worth considering as a way to forcibly terminate all pods on the node. It's beta in 1.26, so it may not be available in all clusters.
+```
+$ kubectl drain my-node --grace-period=0 --disable-eviction=true
+
+$ kubectl drain --help
+Options:
+    --grace-period=-1:
+	Period of time in seconds given to each pod to terminate gracefully. If negative, the default value specified
+	in the pod will be used.
+
+    --disable-eviction=false:
+	Force drain to use delete, even if eviction is supported. This will bypass checking PodDisruptionBudgets, use
+	with caution.
+```