diff --git a/designs/disruption-termination-grace-period.md b/designs/disruption-termination-grace-period.md new file mode 100644 index 0000000000..84145b5376 --- /dev/null +++ b/designs/disruption-termination-grace-period.md @@ -0,0 +1,123 @@ +# Disruption Termination Grace Period + +## Motivation +Users are requesting the ability to control how long karpenter will wait for the deletion of an individual node to complete before forcefully terminating the node regardless of pod status. [#743](https://github.com/kubernetes-sigs/karpenter/issues/743). This supports two primary use cases. +* Cluster admins who want to ensure that nodes are cycled after a given period of time, regardless of user-defined disruption controls (such as PDBs or preStop hooks) that might prevent eviction of a pod beyond the configured limit. This could be to satisfy security requirements or for convenience. +* Cluster admins who want to allow users to protect long-running jobs from being interrupted by node disruption, up to a configured limit. + +This design piggybacks off the existing disruption-controls design, adding in a new `terminationGracePeriodSeconds` field to the `v1beta/NodePool` `Disruption` spec. This field seems intuitively related to the other disruption controls that exist within the disruption block. + +## Proposed Spec + +```yaml +apiVersion: karpenter.sh/v1beta1 +kind: NodePool +metadata: + name: default +spec: # This is not a complete NodePool Spec. + disruption: + consolidationPolicy: WhenUnderutilized || WhenEmpty + consolidateAfter: 10m || Never # metav1.Duration + expireAfter: 10m || Never # Equivalent to v1alpha5 TTLSecondsUntilExpired + terminationGracePeriodSeconds: 24h || nil +``` + +``` +$ kubectl explain nodepool.spec.disruption.terminationGracePeriodSeconds +KIND: NodePool +VERSION: v1beta1 + +FIELD: terminationGracePeriodSeconds + +DESCRIPTION: + Optional duration in seconds the node is provided to terminate gracefully. May be + decreased in delete request. Value must be non-negative integer. The value + zero indicates that all pods should be terminated immediately, bypassing eviction + (including any PDBs). If this value is nil, deletion of the node will wait indefinitely + for pods to terminate gracefully before finalizing cleanup. Set this value to ensure + that nodes are deletion after a certain amount of time if your cluster operation + practices need it. Can also be used to provide a hard limit for users to extend the + run time of jobs during node deletion. By default (nil), node deletion waits for + the graceful termination of all pods. +``` + +## Code Definition + +```go +type Disruption struct { + {...} + // TerminationGracePeriodSeconds is a nillable duration, parsed as a metav1.Duration. + // A nil value means there is no timeout before pods will be forcibly evicted during node deletion. + TerminationGracePeriodSeconds *NillableDuration `json:"terminationGracePeriodSeconds" hash:"ignore"` +} +``` + +## Validation/Defaults +The `terminationGracePeriodSeconds` fields accepts a common duration string which defaults to a value of `nil`. Omitting the field results in the default value instructing the controller to wait indefinitely for pods to drain gracefully, maintaining the existing behavior. +A value of 0 instructs the controller to evict all pods by force immediately, matching the behavior of `pod.spec.terminationGracePeriodSeconds` + +## Prior Art +This is the field from CAPI's MachineDeployment spec which implements similar behavior. +``` +$ kubectl explain machinedeployment.spec.template.spec.nodeDrainTimeout +KIND: MachineDeployment +VERSION: cluster.x-k8s.io/v1beta1 + +FIELD: nodeDrainTimeout + +DESCRIPTION: + NodeDrainTimeout is the total amount of time that the controller will spend + on draining a node. The default value is 0, meaning that the node can be + drained without any time limitations. NOTE: NodeDrainTimeout is different + from `kubectl drain --timeout` +``` + +``` +$ kubectl explain pod.spec.terminationGracePeriodSeconds +KIND: Pod +VERSION: v1 + +FIELD: terminationGracePeriodSeconds + +DESCRIPTION: + Optional duration in seconds the pod needs to terminate gracefully. May be + decreased in delete request. Value must be non-negative integer. The value + zero indicates stop immediately via the kill signal (no opportunity to shut + down). If this value is nil, the default grace period will be used instead. + The grace period is the duration in seconds after the processes running in + the pod are sent a termination signal and the time when the processes are + forcibly halted with a kill signal. Set this value longer than the expected + cleanup time for your process. Defaults to 30 seconds. +``` + +## Implementation + +### Termination Grace Period Expiration Detection +Because Karpenter already ensures that nodes and their nodeclaims are in a deleting state before performing a node drain during node deletion, we should be able to leverage existing `deletionTimestamp` fields to avoid the need for an addition annotation or other tracking label. +I believe we should use the NodeClaim `deletionTimestamp` specifically to avoid depending on the Node's `deletionTimestamp`, because [Kubernetes docs](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) recommend draining a node *before* deleting it, which is counter to how Karpenter behaves today (relying on Node finalizers to handle cleanup). + +If drains are not already periodically requeued, we may need to either modify the current drain wait logic to periodically check if a node has exceeded its termination grace period. + +### Termination Grace Period Expiration Behavior +1. Node deletion occurs (user initiated, node rotation from drift, etc). +2. NodeClaim deletionTimestamp is set. +3. Standard node drain process begins. +4. Time passes, configured terminationGracePeriod is exceeding for the current node. +5. Remaining pods are forcibly drained, ignoring eviction (bypassing PDBs, preStop hooks, pod terminationGracePeriod, etc) +6. Node is terminated at the cloud provider, regardless of the status of remaining pods (such as daemonsets that couldn't be force drained). + +"Force Drain" behavior should be similar to using kubectl to bypass standard eviction protections. +This [official mechanism](https://kubernetes.io/blog/2022/12/16/kubernetes-1-26-non-graceful-node-shutdown-beta/#how-does-it-work) is worth considering as a way to forcibly terminate all pods on the node. It's beta in 1.26, so it may not be available in all clusters. +``` +$ kubectl drain my-node --grace-period=0 --disable-eviction=true + +$ kubectl drain --help +Options: + --grace-period=-1: + Period of time in seconds given to each pod to terminate gracefully. If negative, the default value specified + in the pod will be used. + + --disable-eviction=false: + Force drain to use delete, even if eviction is supported. This will bypass checking PodDisruptionBudgets, use + with caution. +``` \ No newline at end of file