Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taint nodes when termination notification is detected #160

Closed
diversario opened this issue May 13, 2020 · 1 comment
Closed

Taint nodes when termination notification is detected #160

diversario opened this issue May 13, 2020 · 1 comment
Labels
Type: Enhancement New feature or request

Comments

@diversario
Copy link
Contributor

diversario commented May 13, 2020

Similar to how cluster-autoscaler uses taints when marking nodes for scale down, aws-node-termination-handler should taint nodes which will be terminated, in addition to everything it does currently. This issue is slightly similar to #123 but the proposal is to add to the current behavior.

The reason for this is to make it possible to detect programmatically that a spot node becoming unschedulable in k8s is due to termination vs due to cluster issues. For example, when using prometheus-operator, default KubeNodeUnreachable alert looks like this:

kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"}

which often misfires for nodes being scaled down by the cluster-autoscaler. To fix the problem, we can drop nodes with a taint:

kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"} unless on(node) kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="ToBeDeletedByClusterAutoscaler"}

If aws-node-termination-handler tainted nodes that taint could be incorporated into the alert as well.

I'm happy to PR this if there are no objections.

@bwagner5
Copy link
Contributor

That sounds like a good addition to the project. A PR is certainly welcome!

We have a node labeling function already defined, so shouldn't be too hard to add a taint function in the node pkg.

Also, should the taint only be applied to spot interruption termination notices (ITNs) or should it also apply to EC2 scheduled maintenance events? Maybe it would make sense to have an aws-node-termination-handler/spot-itn and aws-node-termination-handler/scheduled-maintenance?

If we include tainting at this point in the code, we'll have the information needed to differentiate between a maintenance event and a spot ITN. Alternatively, the taint could be added to each of the drain events PreDrainTask hooks.

@bwagner5 bwagner5 added the Type: Enhancement New feature or request label May 13, 2020
diversario added a commit to diversario/aws-node-termination-handler that referenced this issue May 15, 2020
Fixes aws#160.

Signed-off-by: Ilya Shaisultanov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants