Skip to content

Commit

Permalink
docs: RFC for disruption.forceDrainAfter feature
Browse files Browse the repository at this point in the history
  • Loading branch information
wmgroot committed Dec 6, 2023
1 parent d940f18 commit e75de87
Showing 1 changed file with 123 additions and 0 deletions.
123 changes: 123 additions & 0 deletions designs/disruption-termination-grace-period.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Disruption Termination Grace Period

## Motivation
Users are requesting the ability to control how long karpenter will wait for the deletion of an individual node to complete before forcefully terminating the node regardless of pod status. [#743](https://github.com/kubernetes-sigs/karpenter/issues/743). This supports two primary use cases.
* Cluster admins who want to ensure that nodes are cycled after a given period of time, regardless of user-defined disruption controls (such as PDBs or preStop hooks) that might prevent eviction of a pod beyond the configured limit. This could be to satisfy security requirements or for convenience.
* Cluster admins who want to allow users to protect long-running jobs from being interrupted by node disruption, up to a configured limit.

This design piggybacks off the existing disruption-controls design, adding in a new `terminationGracePeriodSeconds` field to the `v1beta/NodePool` `Disruption` spec. This field seems intuitively related to the other disruption controls that exist within the disruption block.

## Proposed Spec

```yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec: # This is not a complete NodePool Spec.
disruption:
consolidationPolicy: WhenUnderutilized || WhenEmpty
consolidateAfter: 10m || Never # metav1.Duration
expireAfter: 10m || Never # Equivalent to v1alpha5 TTLSecondsUntilExpired
terminationGracePeriodSeconds: 24h || nil
```
```
$ kubectl explain nodepool.spec.disruption.terminationGracePeriodSeconds
KIND: NodePool
VERSION: v1beta1

FIELD: terminationGracePeriodSeconds <integer>

DESCRIPTION:
Optional duration in seconds the node is provided to terminate gracefully. May be
decreased in delete request. Value must be non-negative integer. The value
zero indicates that all pods should be terminated immediately, bypassing eviction
(including any PDBs). If this value is nil, deletion of the node will wait indefinitely
for pods to terminate gracefully before finalizing cleanup. Set this value to ensure
that nodes are deletion after a certain amount of time if your cluster operation
practices need it. Can also be used to provide a hard limit for users to extend the
run time of jobs during node deletion. By default (nil), node deletion waits for
the graceful termination of all pods.
```
## Code Definition
```go
type Disruption struct {
{...}
// TerminationGracePeriodSeconds is a nillable duration, parsed as a metav1.Duration.
// A nil value means there is no timeout before pods will be forcibly evicted during node deletion.
TerminationGracePeriodSeconds *NillableDuration `json:"terminationGracePeriodSeconds" hash:"ignore"`
}
```

## Validation/Defaults
The `terminationGracePeriodSeconds` fields accepts a common duration string which defaults to a value of `nil`. Omitting the field results in the default value instructing the controller to wait indefinitely for pods to drain gracefully, maintaining the existing behavior.
A value of 0 instructs the controller to evict all pods by force immediately, matching the behavior of `pod.spec.terminationGracePeriodSeconds`

## Prior Art
This is the field from CAPI's MachineDeployment spec which implements similar behavior.
```
$ kubectl explain machinedeployment.spec.template.spec.nodeDrainTimeout
KIND: MachineDeployment
VERSION: cluster.x-k8s.io/v1beta1
FIELD: nodeDrainTimeout <string>
DESCRIPTION:
NodeDrainTimeout is the total amount of time that the controller will spend
on draining a node. The default value is 0, meaning that the node can be
drained without any time limitations. NOTE: NodeDrainTimeout is different
from `kubectl drain --timeout`
```

```
$ kubectl explain pod.spec.terminationGracePeriodSeconds
KIND: Pod
VERSION: v1
FIELD: terminationGracePeriodSeconds <integer>
DESCRIPTION:
Optional duration in seconds the pod needs to terminate gracefully. May be
decreased in delete request. Value must be non-negative integer. The value
zero indicates stop immediately via the kill signal (no opportunity to shut
down). If this value is nil, the default grace period will be used instead.
The grace period is the duration in seconds after the processes running in
the pod are sent a termination signal and the time when the processes are
forcibly halted with a kill signal. Set this value longer than the expected
cleanup time for your process. Defaults to 30 seconds.
```

## Implementation

### Termination Grace Period Expiration Detection
Because Karpenter already ensures that nodes and their nodeclaims are in a deleting state before performing a node drain during node deletion, we should be able to leverage existing `deletionTimestamp` fields to avoid the need for an addition annotation or other tracking label.
I believe we should use the NodeClaim `deletionTimestamp` specifically to avoid depending on the Node's `deletionTimestamp`, because [Kubernetes docs](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) recommend draining a node *before* deleting it, which is counter to how Karpenter behaves today (relying on Node finalizers to handle cleanup).

If drains are not already periodically requeued, we may need to either modify the current drain wait logic to periodically check if a node has exceeded its termination grace period.

### Termination Grace Period Expiration Behavior
1. Node deletion occurs (user initiated, node rotation from drift, etc).
2. NodeClaim deletionTimestamp is set.
3. Standard node drain process begins.
4. Time passes, configured terminationGracePeriod is exceeding for the current node.
5. Remaining pods are forcibly drained, ignoring eviction (bypassing PDBs, preStop hooks, pod terminationGracePeriod, etc)
6. Node is terminated at the cloud provider, regardless of the status of remaining pods (such as daemonsets that couldn't be force drained).

"Force Drain" behavior should be similar to using kubectl to bypass standard eviction protections.
This [official mechanism](https://kubernetes.io/blog/2022/12/16/kubernetes-1-26-non-graceful-node-shutdown-beta/#how-does-it-work) is worth considering as a way to forcibly terminate all pods on the node. It's beta in 1.26, so it may not be available in all clusters.
```
$ kubectl drain my-node --grace-period=0 --disable-eviction=true
$ kubectl drain --help
Options:
--grace-period=-1:
Period of time in seconds given to each pod to terminate gracefully. If negative, the default value specified
in the pod will be used.
--disable-eviction=false:
Force drain to use delete, even if eviction is supported. This will bypass checking PodDisruptionBudgets, use
with caution.
```

0 comments on commit e75de87

Please sign in to comment.