Failing to drain/cordon causes CrashLoopBackOff #272

blakestoddard · 2020-10-29T18:02:21Z

[Queue processor specific]

Running into issue here around the behavior of exiting with an error code of 1 when a message is pulled from SQS and then cordoning or draining of that node fails.

In our use-case, we have multiple EKS clusters in an account in a region. When approaching how to handle this with NTH, I was going to run per-cluster SQS queues with CloudWatch events + rules that were tailored to the proper cluster/ASG combo, and this works for ASG events, but EC2 termination events do not have the same filtering capacity -- meaning that you end up with events in SQS queues even though those instances may not be in the cluster. When NTH processes one of those, it will exit(1) because there will be no node in the cluster to cordon or drain. After so many, Kubernetes will mark the pod in a CrashLoopBackOff state and you will lose all NTH capabilities until the cool-down period expires and Kubernetes starts the pod again.

For an NTH pod that I've been running for a week, it's seen 600+ crashes:

aws-node-termination-handler-86d9c789f4-6c9pz   0/1     CrashLoopBackOff     607        6d2h

2020/10/29 17:29:57 ??? Trying to get token from IMDSv2
2020/10/29 17:29:57 ??? Got token from IMDSv2
2020/10/29 17:29:57 ??? Startup Metadata Retrieved metadata={"accountId":"0xxxxxxxxx","availabilityZone":"us-east-1b","instanceId":"i-0xxxxxxxx","instanceType":"c5.2xlarge","localHostname":"ip-10-xxx-xxx-xxx.ec2.internal","privateIp":"10.xxx.xxx.xxx","publicHostname":"","publicIp":"","region":"us-east-1"}
2020/10/29 17:29:57 ??? aws-node-termination-handler arguments: 
	dry-run: false,
	node-name: ip-10-xxx-xxx-xxx.ec2.internal,
	metadata-url: http://169.254.169.254,
	kubernetes-service-host: 172.20.0.1,
	kubernetes-service-port: 443,
	delete-local-data: true,
	ignore-daemon-sets: true,
	pod-termination-grace-period: -1,
	node-termination-grace-period: 120,
	enable-scheduled-event-draining: false,
	enable-spot-interruption-draining: false,
	enable-sqs-termination-draining: true,
	metadata-tries: 3,
	cordon-only: false,
	taint-node: false,
	json-logging: false,
	log-level: INFO,
	webhook-proxy: ,
	webhook-headers: <not-displayed>,
	webhook-url: ,
	webhook-template: <not-displayed>,
	uptime-from-file: ,
	enable-prometheus-server: false,
	prometheus-server-port: 9092,
	aws-region: us-east-1,
	queue-url: https://sqs.us-east-1.amazonaws.com/0xxxxxxxxxx/node_termination_handler,
	check-asg-tag-before-draining: true,
	aws-endpoint: ,

2020/10/29 17:29:57 ??? Started watching for interruption events
2020/10/29 17:29:57 ??? Kubernetes AWS Node Termination Handler has started successfully!
2020/10/29 17:29:57 ??? Started watching for event cancellations
2020/10/29 17:29:57 ??? Started monitoring for events event_type=SQS_TERMINATE
2020/10/29 17:30:00 ??? Adding new event to the event store event={"Description":"Spot Interruption event received. Instance will be interrupted at 2020-10-29 17:28:15 +0000 UTC \n","Drained":false,"EndTime":"0001-01-01T00:00:00Z","EventID":"spot-itn-event-12345","InstanceID":"i-0xxxxxxxxxx","Kind":"SQS_TERMINATE","NodeName":"ip-10-xxx-xxx-xxx.ec2.internal","StartTime":"2020-10-29T17:28:15Z","State":""}
2020/10/29 17:30:01 ??? Cordoning the node
2020/10/29 17:30:01 WRN Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2020/10/29 17:30:01 ??? There was a problem while trying to cordon and drain the node error="nodes \"ip-10-xxx-xxx-xxx.ec2.internal\" not found"

The text was updated successfully, but these errors were encountered:

bwagner5 · 2020-10-29T18:11:15Z

There's a couple options that could mitigate this behavior:

Extend the ASG tagging to determine if NTH should manage the ASGs to EC2 Instance tags and only cordon/drain if the tag is present. (pro: pretty simple to implement, but a con is that EC2 api calls would increase)
The cluster-api machine spec implementation for AWS has a separate process that adds instance-ids to the eventbridge rules so that only the instances in your cluster would be sent. I'm not sure this would completely solve the problem, since if you have nodes managed by an ASG, NTH is racing with itself on the cordon/drain action between the lifecycle hook and the ec2 status change.
As you suggested, If the node isn't found in the cluster, we just don't exit. This might be okay, I'd have to think through the ramifications if there was an issue causing a API calls to surge or the process would benefit from restarting (maybe resetting sdk clients)

blakestoddard · 2020-10-29T18:41:52Z

I think additional ASG-propagated instance tagging could be a solution. Or perhaps NTHManagedASG becomes configurable?

aws-node-termination-handler/pkg/monitor/sqsevent/sqs-monitor.go

Line 35 in 5150b31

NTHManagedASG = "aws-node-termination-handler/managed"

This would give you additional control over what NTH installs manage what EC2 instances, without additional calls to the EC2 API beyond what you do now. I don't hate the default exit(1) behavior -- it's not the root cause, it just gets exacerbated by this issue.

blakestoddard · 2020-10-29T18:43:59Z

Unconsidered here is what happens to that SQS message? Does it stay in the queue forever? Does it get dropped if it's found to not be managed by this particular NTH instance?

If it's the latter, I think you could get away with per-cluster SQS queues/events as a path for running multiple NTH installs in a single account.

bwagner5 · 2020-10-29T18:50:20Z

Unconsidered here is what happens to that SQS message? Does it stay in the queue forever? Does it get dropped if it's found to not be managed by this particular NTH instance?

It will not be dropped immediately, but the queue should be set to discard after a time period. The readme of NTH sets it at 5 mins I believe. That way, if there is a problem with the EC2 api, K8s api, or a node is having problems, the message will get retried.

blakestoddard · 2020-10-29T18:53:33Z

Oh sweet, then you could still run multiple NTH installs out of a single queue then as long as NTH isn't deleting items that it fails to process!

bwagner5 · 2020-10-29T18:57:51Z

Well I'm not sure it's a great idea still. It doesn't delete it, but depending on your visibility timeout setting on cloudwatch (I think it's set at 20 sec), you could get unlucky and have the wrong NTH's pulling it every time until it hits the 300 sec deletion time.

blakestoddard · 2020-11-06T20:08:20Z

I put together a rev of the customizable tag in https://github.com/blakestoddard/aws-node-termination-handler/tree/managed-asg-tag and am running it in four of our clusters (non are prod). I'll check-in in a few days to see if it's still being crash-happy.

paalkr · 2020-11-10T20:32:35Z

I would just like to drop in and say that I'm facing the exact same issue. We are running several K8s clusters and auto scaling groups in the same VPC. A dedicated SQS queue and a set of rules are deployed pr cluster, put the ec2 events cannot be filtered based on the Auto Scaling groups and ends up in all queues.

I think a solution where nth just skip processing nodes that are not part of the cluster would work. But there might other better solutions, and ideally the filters should make sure that only relevant messages should end up in a queue.

bwagner5 · 2020-11-10T22:59:58Z

good feedback @paalkr!

Can you scope your rule to only send events from certain ASGs?

https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutRule.html
https://docs.aws.amazon.com/eventbridge/latest/userguide/filtering-examples-structure.html

paalkr · 2020-11-11T06:12:06Z

Yes, that's a possibility for the ASG events like
EC2 Instance-terminate Lifecycle Action

But unfortunately its not possible for the ec2 events like
EC2 Instance State-change Notification
EC2 Spot Instance Interruption Warning
EC2 Instance Rebalance Recommendation

Creating dedicated rules pr AutoScaling Group could also quite easily in our use case result in more then 100 EventBridge rules for the default EventBus, which is a hard limit. And it's not possible to use a custom EventBus for AWS resource events.

bwagner5 · 2020-11-12T14:15:59Z

ah gotcha, sorry lost that context from the original post. We will try to get the instance tag check in soon, I think it would be a good idea to fallback to instance check on ASG as well just in case there are rate limit issues with ASG describe-tags call.

blakestoddard · 2020-11-12T17:26:46Z

Do you have any interest in a PR for my approach, too @bwagner5? There's no additional API calls needed (and would be just a general nice-to-have anyway since the current tag key is pretty broad).

main...blakestoddard:managed-asg-tag

bwagner5 · 2020-11-12T21:35:12Z

I'm hesitant to make any more configuration. I'm open to feedback, but it seems simpler to stick with the default key and not allow it to be configurable. I'm not seeing much value being added from making that tag configurable, is there a specific reason?

I was thinking that aws-node-termination-handler/managed would be the general tag key for any resource that NTH is managing whether it is at the instance level or the ASG level where it applies to multiple instances.

paalkr · 2020-11-12T21:39:47Z

So is the plan to just skip processing events for instances that is added to the queue, if the instance is not recognized as a node attached to the current cluster?

bwagner5 · 2020-11-12T21:41:34Z

So is the plan to just skip processing events for instances that is added to the queue, if the instance is not recognized as a node attached to the current cluster?

Skip processing events for nodes that are not tagged with aws-node-termination-handler/managed or reside in an ASG tagged with that tag.

blakestoddard · 2020-11-12T21:41:49Z

is there a specific reason?

Yeah, my approach here doesn't require additional API calls versus also checking the instance tags instead (you also don't have to worry about propagating that tag to existing instances if you add it to an existing ASG). There's less feedback loops to carry through by not having to check tags in two spots. And even with checking the instance tag, you'd still want that to be some level of configurable or else it doesn't solve this issue at all. (since having the tag on all instances doesn't help to solve the issue of X instance being in X cluster instead of Y cluster if it's still the same tag across everything)

I think it's unfair to set this up in such a way that it only works well if there is one cluster per region per account, and having the tag to verify (whether at the instance level or on the ASG) be unconfigurable would result in that currently.

paalkr · 2020-11-12T21:43:23Z

So is the plan to just skip processing events for instances that is added to the queue, if the instance is not recognized as a node attached to the current cluster?

Skip processing events for nodes that are not tagged with aws-node-termination-handler/managed or reside in an ASG tagged with that tag.

That is not a guarantee for the instance being part of the current cluster. We do run several Kops clusters in the same PVC in the same region.

And I as mentioned earlier, these events cannot be filtered by an EventBridge rule. It's not possible for the rule to determine which ASG the instance belong to. So all event for all instances will enter all queues.
EC2 Instance State-change Notification
EC2 Spot Instance Interruption Warning
EC2 Instance Rebalance Recommendation

bwagner5 · 2020-11-12T21:48:39Z

👍 ah gotcha (both of you 🙂 ). I'm cool with the tag configuration approach then.

@paalkr Do you think that would work for your case as well (specifying different tags for each of your clusters)?

aws-node-termination-handler/managed/cluster1
aws-node-termination-handler/managed/cluster2

paalkr · 2020-11-12T21:49:54Z

Yes, that would work. Or even better maybe
aws-node-termination-handler/managed: cluster1
aws-node-termination-handler/managed: cluster2

bwagner5 · 2020-11-12T21:51:29Z

Feel free to PR that change then @blakestoddard 😀

blakestoddard · 2020-11-12T21:59:42Z

PR for the configurable tag change opened. I can change it look for a value, too, but that's some more significant code changes that I'm not sure are worth the complexity.

paalkr · 2020-11-12T22:01:56Z

Excellent! Key or value, it doesn't really matter. Both will work :)

universam1 · 2020-12-04T09:58:49Z

The original problem is not solved yet - the filter is great but it does not apply for all conditions like mentioned in #307

haugenj · 2020-12-14T22:31:55Z

This should be fixed with the release of v1.11.1 ! If not please reopen and we'll take another look (and maybe fully remove those os.Exit(1)s)

bwagner5 added the Type: Bug Something isn't working label Oct 29, 2020

blakestoddard mentioned this issue Nov 12, 2020

Allow the NTHManagedAsg tag to be configurable #287

Merged

brycahta closed this as completed in #287 Nov 18, 2020

bwagner5 mentioned this issue Nov 27, 2020

Multiple events for the same node cause a crash #307

Closed

paalkr mentioned this issue Nov 30, 2020

Drain nodes when they terminate kubernetes/kops#7119

Closed

bwagner5 reopened this Dec 4, 2020

haugenj closed this as completed Dec 14, 2020

bwagner5 mentioned this issue Jan 5, 2021

How can we prevent from NTH to receive the events from different clusters? #340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to drain/cordon causes CrashLoopBackOff #272

Failing to drain/cordon causes CrashLoopBackOff #272

blakestoddard commented Oct 29, 2020 •

edited

Loading

bwagner5 commented Oct 29, 2020

blakestoddard commented Oct 29, 2020 •

edited

Loading

blakestoddard commented Oct 29, 2020

bwagner5 commented Oct 29, 2020

blakestoddard commented Oct 29, 2020

bwagner5 commented Oct 29, 2020

blakestoddard commented Nov 6, 2020 •

edited

Loading

paalkr commented Nov 10, 2020

bwagner5 commented Nov 10, 2020

paalkr commented Nov 11, 2020

bwagner5 commented Nov 12, 2020

blakestoddard commented Nov 12, 2020

bwagner5 commented Nov 12, 2020

paalkr commented Nov 12, 2020 •

edited

Loading

bwagner5 commented Nov 12, 2020

blakestoddard commented Nov 12, 2020 •

edited

Loading

paalkr commented Nov 12, 2020 •

edited

Loading

bwagner5 commented Nov 12, 2020

paalkr commented Nov 12, 2020

bwagner5 commented Nov 12, 2020

blakestoddard commented Nov 12, 2020

paalkr commented Nov 12, 2020

universam1 commented Dec 4, 2020

haugenj commented Dec 14, 2020

Failing to drain/cordon causes CrashLoopBackOff #272

Failing to drain/cordon causes CrashLoopBackOff #272

Comments

blakestoddard commented Oct 29, 2020 • edited Loading

bwagner5 commented Oct 29, 2020

blakestoddard commented Oct 29, 2020 • edited Loading

blakestoddard commented Oct 29, 2020

bwagner5 commented Oct 29, 2020

blakestoddard commented Oct 29, 2020

bwagner5 commented Oct 29, 2020

blakestoddard commented Nov 6, 2020 • edited Loading

paalkr commented Nov 10, 2020

bwagner5 commented Nov 10, 2020

paalkr commented Nov 11, 2020

bwagner5 commented Nov 12, 2020

blakestoddard commented Nov 12, 2020

bwagner5 commented Nov 12, 2020

paalkr commented Nov 12, 2020 • edited Loading

bwagner5 commented Nov 12, 2020

blakestoddard commented Nov 12, 2020 • edited Loading

paalkr commented Nov 12, 2020 • edited Loading

bwagner5 commented Nov 12, 2020

paalkr commented Nov 12, 2020

bwagner5 commented Nov 12, 2020

blakestoddard commented Nov 12, 2020

paalkr commented Nov 12, 2020

universam1 commented Dec 4, 2020

haugenj commented Dec 14, 2020

blakestoddard commented Oct 29, 2020 •

edited

Loading

blakestoddard commented Oct 29, 2020 •

edited

Loading

blakestoddard commented Nov 6, 2020 •

edited

Loading

paalkr commented Nov 12, 2020 •

edited

Loading

blakestoddard commented Nov 12, 2020 •

edited

Loading

paalkr commented Nov 12, 2020 •

edited

Loading