-
Notifications
You must be signed in to change notification settings - Fork 222
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: RFC for disruption.forceDrainAfter feature
- Loading branch information
Showing
1 changed file
with
123 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
# Disruption Termination Grace Period | ||
|
||
## Motivation | ||
Users are requesting the ability to control how long karpenter will wait for the deletion of an individual node to complete before forcefully terminating the node regardless of pod status. [#743](https://github.com/kubernetes-sigs/karpenter/issues/743). This supports two primary use cases. | ||
* Cluster admins who want to ensure that nodes are cycled after a given period of time, regardless of user-defined disruption controls (such as PDBs or preStop hooks) that might prevent eviction of a pod beyond the configured limit. This could be to satisfy security requirements or for convenience. | ||
* Cluster admins who want to allow users to protect long-running jobs from being interrupted by node disruption, up to a configured limit. | ||
|
||
This design piggybacks off the existing disruption-controls design, adding in a new `terminationGracePeriodSeconds` field to the `v1beta/NodePool` `Disruption` spec. This field seems intuitively related to the other disruption controls that exist within the disruption block. | ||
|
||
## Proposed Spec | ||
|
||
```yaml | ||
apiVersion: karpenter.sh/v1beta1 | ||
kind: NodePool | ||
metadata: | ||
name: default | ||
spec: # This is not a complete NodePool Spec. | ||
disruption: | ||
consolidationPolicy: WhenUnderutilized || WhenEmpty | ||
consolidateAfter: 10m || Never # metav1.Duration | ||
expireAfter: 10m || Never # Equivalent to v1alpha5 TTLSecondsUntilExpired | ||
terminationGracePeriodSeconds: 24h || nil | ||
``` | ||
``` | ||
$ kubectl explain nodepool.spec.disruption.terminationGracePeriodSeconds | ||
KIND: NodePool | ||
VERSION: v1beta1 | ||
|
||
FIELD: terminationGracePeriodSeconds <integer> | ||
|
||
DESCRIPTION: | ||
Optional duration in seconds the node is provided to terminate gracefully. May be | ||
decreased in delete request. Value must be non-negative integer. The value | ||
zero indicates that all pods should be terminated immediately, bypassing eviction | ||
(including any PDBs). If this value is nil, deletion of the node will wait indefinitely | ||
for pods to terminate gracefully before finalizing cleanup. Set this value to ensure | ||
that nodes are deletion after a certain amount of time if your cluster operation | ||
practices need it. Can also be used to provide a hard limit for users to extend the | ||
run time of jobs during node deletion. By default (nil), node deletion waits for | ||
the graceful termination of all pods. | ||
``` | ||
## Code Definition | ||
```go | ||
type Disruption struct { | ||
{...} | ||
// TerminationGracePeriodSeconds is a nillable duration, parsed as a metav1.Duration. | ||
// A nil value means there is no timeout before pods will be forcibly evicted during node deletion. | ||
TerminationGracePeriodSeconds *NillableDuration `json:"terminationGracePeriodSeconds" hash:"ignore"` | ||
} | ||
``` | ||
|
||
## Validation/Defaults | ||
The `terminationGracePeriodSeconds` fields accepts a common duration string which defaults to a value of `nil`. Omitting the field results in the default value instructing the controller to wait indefinitely for pods to drain gracefully, maintaining the existing behavior. | ||
A value of 0 instructs the controller to evict all pods by force immediately, matching the behavior of `pod.spec.terminationGracePeriodSeconds` | ||
|
||
## Prior Art | ||
This is the field from CAPI's MachineDeployment spec which implements similar behavior. | ||
``` | ||
$ kubectl explain machinedeployment.spec.template.spec.nodeDrainTimeout | ||
KIND: MachineDeployment | ||
VERSION: cluster.x-k8s.io/v1beta1 | ||
FIELD: nodeDrainTimeout <string> | ||
DESCRIPTION: | ||
NodeDrainTimeout is the total amount of time that the controller will spend | ||
on draining a node. The default value is 0, meaning that the node can be | ||
drained without any time limitations. NOTE: NodeDrainTimeout is different | ||
from `kubectl drain --timeout` | ||
``` | ||
|
||
``` | ||
$ kubectl explain pod.spec.terminationGracePeriodSeconds | ||
KIND: Pod | ||
VERSION: v1 | ||
FIELD: terminationGracePeriodSeconds <integer> | ||
DESCRIPTION: | ||
Optional duration in seconds the pod needs to terminate gracefully. May be | ||
decreased in delete request. Value must be non-negative integer. The value | ||
zero indicates stop immediately via the kill signal (no opportunity to shut | ||
down). If this value is nil, the default grace period will be used instead. | ||
The grace period is the duration in seconds after the processes running in | ||
the pod are sent a termination signal and the time when the processes are | ||
forcibly halted with a kill signal. Set this value longer than the expected | ||
cleanup time for your process. Defaults to 30 seconds. | ||
``` | ||
|
||
## Implementation | ||
|
||
### Termination Grace Period Expiration Detection | ||
Because Karpenter already ensures that nodes and their nodeclaims are in a deleting state before performing a node drain during node deletion, we should be able to leverage existing `deletionTimestamp` fields to avoid the need for an addition annotation or other tracking label. | ||
I believe we should use the NodeClaim `deletionTimestamp` specifically to avoid depending on the Node's `deletionTimestamp`, because [Kubernetes docs](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) recommend draining a node *before* deleting it, which is counter to how Karpenter behaves today (relying on Node finalizers to handle cleanup). | ||
|
||
If drains are not already periodically requeued, we may need to either modify the current drain wait logic to periodically check if a node has exceeded its termination grace period. | ||
|
||
### Termination Grace Period Expiration Behavior | ||
1. Node deletion occurs (user initiated, node rotation from drift, etc). | ||
2. NodeClaim deletionTimestamp is set. | ||
3. Standard node drain process begins. | ||
4. Time passes, configured terminationGracePeriod is exceeding for the current node. | ||
5. Remaining pods are forcibly drained, ignoring eviction (bypassing PDBs, preStop hooks, pod terminationGracePeriod, etc) | ||
6. Node is terminated at the cloud provider, regardless of the status of remaining pods (such as daemonsets that couldn't be force drained). | ||
|
||
"Force Drain" behavior should be similar to using kubectl to bypass standard eviction protections. | ||
This [official mechanism](https://kubernetes.io/blog/2022/12/16/kubernetes-1-26-non-graceful-node-shutdown-beta/#how-does-it-work) is worth considering as a way to forcibly terminate all pods on the node. It's beta in 1.26, so it may not be available in all clusters. | ||
``` | ||
$ kubectl drain my-node --grace-period=0 --disable-eviction=true | ||
$ kubectl drain --help | ||
Options: | ||
--grace-period=-1: | ||
Period of time in seconds given to each pod to terminate gracefully. If negative, the default value specified | ||
in the pod will be used. | ||
--disable-eviction=false: | ||
Force drain to use delete, even if eviction is supported. This will bypass checking PodDisruptionBudgets, use | ||
with caution. | ||
``` |