RFC: Karpenter Node Auto Repair

kubernetes-sigs · Oct 31, 2024 · ef99874 · ef99874
1 parent 3f47544
commit ef99874
Showing 1 changed file with 95 additions and 0 deletions.
diff --git a/designs/node-repair.md b/designs/node-repair.md
@@ -0,0 +1,95 @@
+# Node Auto Repair 
+
+## Problem Statement
+
+Nodes can experience failure modes that cause degredation to the underlying hardware, filesystem, or container environment. Some of these failure modes are surfaced through the Node object (consider referencing some of them) while others are not surfaced at all (consider referencing some of these). Node Problem Detector offers a way to surface these failures as additional status conditions on the node object.
+
+In either case, even if a status condition is surfaced through the Node that indicates that the Node is unhealthy, Karpenter doesn't currently react to this unhealthiness in any way today. 
+
+* Mega Issue: https://github.com/kubernetes-sigs/karpenter/issues/750
+    * Related (Unreachable): https://github.com/aws/karpenter-provider-aws/issues/2570
+    * Related (Remove by taints): https://github.com/aws/karpenter-provider-aws/issues/2544
+    * Related (Known resource are not registered) Fixed by v0.28.0: https://github.com/aws/karpenter-provider-aws/issues/3794
+    * Related (Stuck on NotReady): https://github.com/aws/karpenter-provider-aws/issues/2439
+
+#### Out of scope
+
+The scope of the solution suggested below will be an opinionated method that will not give customer configurability. The team does not have enough data to determine the right level of configuration surface that users will utilize. The implementation will not consider budgets, API surface for repairing unhealthy nodes, or option for nodes to be rebooted instead of replaced. The solution will also not give any Karpenter based approach to enable AZ resiliency. **The feature will be gated under an alpha feature flag to allow for additional feedback from customers, and make subsequent changes in the future.**
+
+### Option  1(recommended): Unhealthy condition set cloud provider interface  
+
+```
+type RepairPolicy struct {
+    // Type of unhealthy state that is found on the node
+    Type metav1.ConditionType 
+    // Status condition of when a node is unhealthy
+    Status metav1.ConditionStatus
+    // TolerationDuration is the duration the controller will wait
+    // before attempting to terminate nodes that are marked for repair.
+    TolerationDuration time.Duration
+}
+
+type CloudProvider interface {
+  ...
+    // RepairPolicy is for CloudProviders to define a set Unhealthy condition for Karpenter 
+    // to monitor on the node. Customer will need 
+    RepairPolicy() []v1.RepairPolicy
+  ...
+}
+```
+
+The `RepairPolicy` will be a set of conditions that the Karpenter controller will watch. The cloud provider can define compatibility with any node diagnostic agent, and track a list of node unhealthy condition types and a duration period to wait until a unhealthy state is considered a terminal: 
+
+1. A diagnostic agent, such as [node problem detector](https://github.com/kubernetes/node-problem-detector), will add a status condition on to a node 
+2. Karpenter will reconcile on nodes and match unhealthy node condition to cloud provider defined unhealthy condition 
+3. Unhealthy condition are added to the NodeClaim
+4. Disruption controller will forcefully terminate the the NodeClaim once the node has been in an unhealthy state for the duration specified by `TolerationDuration` of the unhealthy condition 
+
+### Option 2: Cloud provider interface for determining node health
+
+```
+type UnhealthyReason string
+
+type CloudProvider interface {
+  ...
+    // IsHealthy returns whether a NodeClaim is considered operating as expected
+    IsHealthy(context.Context, *v1.NodeClaim) (UnhealthyReason, error)
+  ...
+}
+```
+
+`IsHealthy` cloud provide methods allows each implementation of Karpenter to determine the healthy state of a NodeClaim and return a specified `UnhealthyReason` indicating that an action on the node is required. `IsHealthy` allows the cloud providers to determine when and on what condition to act on cleaning up the node. Cloud providers will also need to consider short lived failure prior to determining if a node is unhealthy. The `IsHealthy` follows the format defined by `IsDrifted`. 
+Main drawback of this option is providers will need to implement much more undifferentiated logic, without having the ability to model duration and remediation on nodes. The reconciliation step for replacing an unhealthy node will look like: 
+
+1. Karpenter will reconcile on the NodeClaim and make a cloud provide call with IsHealthy
+2. If the node is determined to unhealthy, a status condition will be added to the NodeClaim
+3. Disruption controller will force terminate the node claim immediately 
+
+### Option 3: Cloud provider injected unhealthy condition 
+
+```
+type UnhealthyType metav1.ConditionType 
+type UnhealthyRepairAfter time.Duration
+
+// UnhealthyConditions are condition to watch for on nodes inside of the cluster
+var UnhealthyConditions = map[UnhealthyType]UnhealthyRepairAfter{
+    "Ready": "30m"
+    "DiskPressure": "10m"
+    ...
+}
+```
+
+Cloud provider hydrate a `UnhealthyConditions` map that will hold the condition to track on nodes. The map will track the condition and `UnhealthyRepairAfter` will dictate the length of time the node will be inside the cluster before action is taken against it. Node health validation will follow as such:
+
+1. A diagnostic agent will add a status condition on to a node 
+2. Karpenter will reconcile on nodes and identify unhealthy nodes defined by `UnhealthyConditions` map
+3. Unhealthy condition is added to the NodeClaim
+4. Disruption controller will forcefully terminate the the NodeClaim once the node has been in an unhealthy state for the duration specified by `UnhealthyConditions` map
+
+## Recommended Solution
+
+The recommended approach is for cloud provider interface should expose a layer of configuration for Karpenter to monitor nodes and take action. I propose we implement option one. Cloud provider can define any set of status conditions that are tracked on the node. An interface approach allows for easy extensibility. Option 2 has the approach of being over generalized as we don’t have use-cases to show the need. All solutions outlined below will be under a feature flag to indicate that these feature are not stable. 
+
+### Forceful termination
+
+For a first iteration approach Karpenter will implement force termination. Today, graceful termination in Karpenter will attempt to wait for pod to be fully drained on a node and all volume attachment to be deleted from a node. This raises the problem that during graceful termination node can be stuck terminating as pod eviction or volume detachment may be broken. In these cases, users will need to take manual action against the node or if set, termination grace period will force terminate the node. For the first iteration of problem, the recommendation will be to force terminate nodes now and follow-up to allow configurability for either the cloud providers or the Karpenter users once we have sufficient data to support the need for graceful termination for unhealthy nodes.