Skip to content

Commit

Permalink
RFC: Karpenter Node Auto Repair
Browse files Browse the repository at this point in the history
  • Loading branch information
engedaam committed Nov 5, 2024
1 parent 3f47544 commit fc27db4
Showing 1 changed file with 182 additions and 0 deletions.
182 changes: 182 additions & 0 deletions designs/node-repair.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Node Auto Repair

## Problem Statement

Nodes can experience failure modes that cause degradation to the underlying hardware, filesystem, or container environment. Some of these failure modes are surfaced through the Node object (consider referencing some of them) while others are not surfaced at all (consider referencing some of these). A Diagnostic Agent such as [Node Problem Detector(NPD)](https://github.com/kubernetes/node-problem-detector) offers a way to surface these failures as additional status conditions on the node object.

In either case, even if a status condition is surfaced through the Node that indicates that the Node is unhealthy, Karpenter doesn't currently react to this unhealthiness in any way today.

* Mega Issue: https://github.com/kubernetes-sigs/karpenter/issues/750
* Related (Unreachable): https://github.com/aws/karpenter-provider-aws/issues/2570
* Related (Remove by taints): https://github.com/aws/karpenter-provider-aws/issues/2544
* Related (Known resource are not registered) Fixed by v0.28.0: https://github.com/aws/karpenter-provider-aws/issues/3794
* Related (Stuck on NotReady): https://github.com/aws/karpenter-provider-aws/issues/2439

#### Out of scope

The alpha implementation will not consider disruption budgets, or API surface for repairing unhealthy nodes. The team does not have enough data to determine the right level of configuration surface that users will utilize. The goal will be to provide an opinionated mechanism help customer not need to consider the definition of unhealthy nodes, ultimately reducing configuration burden for most customers. **The feature will be gated under an alpha NodeRepair=true feature flag to allow customer to enable the feature. This will allow for additional feedback from customers, and make subsequent changes in the future that can support the features we have considered out of scope for the Alpha stage.**

### Option 1(recommended): Unhealthy condition set cloud provider interface

```
type RepairPolicy struct {
// Type of unhealthy state that is found on the node
Type metav1.ConditionType
// Status condition of when a node is unhealthy
Status metav1.ConditionStatus
// TolerationDuration is the duration the controller will wait
// before attempting to terminate nodes that are marked for repair.
TolerationDuration time.Duration
}
type CloudProvider interface {
...
// RepairPolicy is for CloudProviders to define a set Unhealthy condition for Karpenter
// to monitor on the node. Customer will need
RepairPolicy() []v1.RepairPolicy
...
}
```

The RepairPolicy will be a set of condition that the Karpenter controller will watch. On any given node, multiple node conditions may exist simultaneously, in those cases we will chose the shortest `TolerationDuration` for a given condition. The cloud provider can define compatibility with any node diagnostic agent, and track a list of node unhealthy condition types and a duration period to wait until a unhealthy state is considered a terminal:

1. A diagnostic agent will add a status condition on to a node
2. Karpenter will reconcile on nodes and match unhealthy node condition to cloud provider defined unhealthy condition
3. Node Health controller will forcefully terminate the the NodeClaim once the node has been in an unhealthy state for the duration specified by TolerationDuration of the unhealthy condition

Example

The example will look at the supporting Node Problem Detector for the AWS Karpenter Provider:

func (c *CloudProvider) RepairPolicy() []cloudprovider.RepairPolicy {
return cloudprovider.RepairPolicy{
{
Type: "Ready"
Status: corev1.ConditionFalse,
TrolorationDuration: "30m"
},
{
Type: "NetworkUnavailable"
Status: corev1.ConditionTrue,
TrolorationDuration: "10m"
},
...
}
}

In the example above AWS Karpenter Provider supports monitoring and terminating two node status condition of the Kubelet Ready condition and the NPD NetworkUnavailable condition. Below are the two cases of when we will act nodes:

```
apiVersion: v1
kind: Node
metadata:
...
status:
condation:
- lastHeartbeatTime: "2024-11-01T16:29:49Z"
lastTransitionTime: "2024-11-01T15:02:48Z"
message: no connection
reason: Network is not available
status: "False"
type: NetworkUnavailable
...
- The Node here will be eligible for node repair after at `2024-11-01T15:12:48Z`
---
apiVersion: v1
kind: Node
metadata:
...
status:
condation:
- lastHeartbeatTime: "2024-11-01T16:29:49Z"
lastTransitionTime: "2024-11-01T15:02:48Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "False"
type: NetworkUnavailable
- lastHeartbeatTime: "2024-11-01T16:29:49Z"
lastTransitionTime: "2024-11-01T15:02:48Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "False"
type: Ready
...
- The Node here will be eligible for node repair after at `2024-11-01T15:32:48Z`
```


### Option 2: Cloud provider interface for determining node health

```
type UnhealthyReason string
type CloudProvider interface {
...
// IsHealthy returns whether a NodeClaim is considered operating as expected
IsHealthy(context.Context, *v1.NodeClaim) (UnhealthyReason, error)
...
}
```

`IsHealthy` cloud provide methods allows each implementation of Karpenter to determine the healthy state of a NodeClaim and return a specified `UnhealthyReason` indicating that an action on the node is required. `IsHealthy` allows the cloud providers to determine when and on what condition to act on cleaning up the node. Cloud providers will also need to consider short lived failure prior to determining if a node is unhealthy. The `IsHealthy` follows the format defined by `IsDrifted`.

Main drawback of this option is providers will need to implement much more undifferentiated logic, without having the ability to model duration and remediation on nodes. The reconciliation step for replacing an unhealthy node will look like:

1. Karpenter will reconcile on the NodeClaim and make a cloud provide call with IsHealthy
2. If the node is determined to unhealthy, a status condition will be added to the NodeClaim
3. Disruption controller will force terminate the node claim immediately

```
apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
...
status:
conditions:
- lastTransitionTime: "2024-11-04T21:06:06Z"
status: "True"
reason: <cloud provider returned UnhealthyReason>
type: Unhealthy
...
```

### Option 3: Cloud provider injected unhealthy condition

```
type UnhealthyType metav1.ConditionType
type UnhealthyRepairAfter time.Duration
// UnhealthyConditions are condition to watch for on nodes inside of the cluster
var UnhealthyConditions = map[UnhealthyType]UnhealthyRepairAfter{
"Ready": "30m"
"DiskPressure": "10m"
...
}
```

Cloud provider hydrate a `UnhealthyConditions` map that will hold the condition to track on nodes. The map will track the condition and `UnhealthyRepairAfter` will dictate the length of time the node will be inside the cluster before action is taken against it. Node health validation will follow as such:

1. A diagnostic agent will add a status condition on to a node
2. Karpenter will reconcile on nodes and identify unhealthy nodes defined by `UnhealthyConditions` map
3. Unhealthy condition is added to the NodeClaim
4. Disruption controller will forcefully terminate the the NodeClaim once the node has been in an unhealthy state for the duration specified by `UnhealthyConditions` map

## Recommended Solution

The recommended approach is for cloud provider interface should expose a layer of configuration for Karpenter to monitor nodes and take action. I propose we implement option one. Cloud provider can define any set of status conditions that are tracked on the node. An interface approach allows for easy extensibility. Option 2 has the approach of being over generalized as we don’t have use-cases to show the need. All solutions outlined below will be under a `NodeRepair=true` feature flag to indicate that these feature are not stable.

### Forceful termination

For a first iteration approach Karpenter will implement force termination. Today, graceful termination in Karpenter will attempt to wait for pod to be fully drained on a node and all volume attachment to be deleted from a node. This raises the problem that during graceful termination node can be stuck terminating as pod eviction or volume detachment may be broken. In these cases, users will need to take manual action against the node. **For the Alpha implementation, the recommendation will be to non-graceful force termination nodes. Furthermore, unhealthy nodes will not respect the customer configured terminationGracePeriod.**

## Future consideration

The recommended solution includes a limited set of features that will improve the customer experience. There are additional feature we will like to include once more user date can help drive the community decision. These include:

* Disruption controllers for unhealthy nodes
* Node Reboot
* Configuration surface for graceful vs force termination
* Additional consideration for availability zone resiliency

0 comments on commit fc27db4

Please sign in to comment.