Taint nodes when termination notification is detected #160

diversario · 2020-05-13T08:01:13Z

Similar to how cluster-autoscaler uses taints when marking nodes for scale down, aws-node-termination-handler should taint nodes which will be terminated, in addition to everything it does currently. This issue is slightly similar to #123 but the proposal is to add to the current behavior.

The reason for this is to make it possible to detect programmatically that a spot node becoming unschedulable in k8s is due to termination vs due to cluster issues. For example, when using prometheus-operator, default KubeNodeUnreachable alert looks like this:

kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"}

which often misfires for nodes being scaled down by the cluster-autoscaler. To fix the problem, we can drop nodes with a taint:

kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"} unless on(node) kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="ToBeDeletedByClusterAutoscaler"}

If aws-node-termination-handler tainted nodes that taint could be incorporated into the alert as well.

I'm happy to PR this if there are no objections.

The text was updated successfully, but these errors were encountered:

bwagner5 · 2020-05-13T14:43:40Z

That sounds like a good addition to the project. A PR is certainly welcome!

We have a node labeling function already defined, so shouldn't be too hard to add a taint function in the node pkg.

Also, should the taint only be applied to spot interruption termination notices (ITNs) or should it also apply to EC2 scheduled maintenance events? Maybe it would make sense to have an aws-node-termination-handler/spot-itn and aws-node-termination-handler/scheduled-maintenance?

If we include tainting at this point in the code, we'll have the information needed to differentiate between a maintenance event and a spot ITN. Alternatively, the taint could be added to each of the drain events PreDrainTask hooks.

Fixes aws#160. Signed-off-by: Ilya Shaisultanov <[email protected]>

bwagner5 added the Type: Enhancement New feature or request label May 13, 2020

diversario added a commit to diversario/aws-node-termination-handler that referenced this issue May 15, 2020

Taint nodes on spot and scheduled events

77ff34a

Fixes aws#160. Signed-off-by: Ilya Shaisultanov <[email protected]>

diversario mentioned this issue May 15, 2020

Taint nodes on spot and scheduled events #162

Merged

bwagner5 closed this as completed in 0d528d6 May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taint nodes when termination notification is detected #160

Taint nodes when termination notification is detected #160

diversario commented May 13, 2020 •

edited

Loading

bwagner5 commented May 13, 2020

Taint nodes when termination notification is detected #160

Taint nodes when termination notification is detected #160

Comments

diversario commented May 13, 2020 • edited Loading

bwagner5 commented May 13, 2020

diversario commented May 13, 2020 •

edited

Loading