Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to drain/cordon causes CrashLoopBackOff #272

Closed
blakestoddard opened this issue Oct 29, 2020 · 24 comments · Fixed by #287
Closed

Failing to drain/cordon causes CrashLoopBackOff #272

blakestoddard opened this issue Oct 29, 2020 · 24 comments · Fixed by #287
Labels
Type: Bug Something isn't working

Comments

@blakestoddard
Copy link
Contributor

blakestoddard commented Oct 29, 2020

[Queue processor specific]

Running into issue here around the behavior of exiting with an error code of 1 when a message is pulled from SQS and then cordoning or draining of that node fails.

In our use-case, we have multiple EKS clusters in an account in a region. When approaching how to handle this with NTH, I was going to run per-cluster SQS queues with CloudWatch events + rules that were tailored to the proper cluster/ASG combo, and this works for ASG events, but EC2 termination events do not have the same filtering capacity -- meaning that you end up with events in SQS queues even though those instances may not be in the cluster. When NTH processes one of those, it will exit(1) because there will be no node in the cluster to cordon or drain. After so many, Kubernetes will mark the pod in a CrashLoopBackOff state and you will lose all NTH capabilities until the cool-down period expires and Kubernetes starts the pod again.

For an NTH pod that I've been running for a week, it's seen 600+ crashes:

aws-node-termination-handler-86d9c789f4-6c9pz   0/1     CrashLoopBackOff     607        6d2h
2020/10/29 17:29:57 ??? Trying to get token from IMDSv2
2020/10/29 17:29:57 ??? Got token from IMDSv2
2020/10/29 17:29:57 ??? Startup Metadata Retrieved metadata={"accountId":"0xxxxxxxxx","availabilityZone":"us-east-1b","instanceId":"i-0xxxxxxxx","instanceType":"c5.2xlarge","localHostname":"ip-10-xxx-xxx-xxx.ec2.internal","privateIp":"10.xxx.xxx.xxx","publicHostname":"","publicIp":"","region":"us-east-1"}
2020/10/29 17:29:57 ??? aws-node-termination-handler arguments: 
	dry-run: false,
	node-name: ip-10-xxx-xxx-xxx.ec2.internal,
	metadata-url: http://169.254.169.254,
	kubernetes-service-host: 172.20.0.1,
	kubernetes-service-port: 443,
	delete-local-data: true,
	ignore-daemon-sets: true,
	pod-termination-grace-period: -1,
	node-termination-grace-period: 120,
	enable-scheduled-event-draining: false,
	enable-spot-interruption-draining: false,
	enable-sqs-termination-draining: true,
	metadata-tries: 3,
	cordon-only: false,
	taint-node: false,
	json-logging: false,
	log-level: INFO,
	webhook-proxy: ,
	webhook-headers: <not-displayed>,
	webhook-url: ,
	webhook-template: <not-displayed>,
	uptime-from-file: ,
	enable-prometheus-server: false,
	prometheus-server-port: 9092,
	aws-region: us-east-1,
	queue-url: https://sqs.us-east-1.amazonaws.com/0xxxxxxxxxx/node_termination_handler,
	check-asg-tag-before-draining: true,
	aws-endpoint: ,

2020/10/29 17:29:57 ??? Started watching for interruption events
2020/10/29 17:29:57 ??? Kubernetes AWS Node Termination Handler has started successfully!
2020/10/29 17:29:57 ??? Started watching for event cancellations
2020/10/29 17:29:57 ??? Started monitoring for events event_type=SQS_TERMINATE
2020/10/29 17:30:00 ??? Adding new event to the event store event={"Description":"Spot Interruption event received. Instance will be interrupted at 2020-10-29 17:28:15 +0000 UTC \n","Drained":false,"EndTime":"0001-01-01T00:00:00Z","EventID":"spot-itn-event-12345","InstanceID":"i-0xxxxxxxxxx","Kind":"SQS_TERMINATE","NodeName":"ip-10-xxx-xxx-xxx.ec2.internal","StartTime":"2020-10-29T17:28:15Z","State":""}
2020/10/29 17:30:01 ??? Cordoning the node
2020/10/29 17:30:01 WRN Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2020/10/29 17:30:01 ??? There was a problem while trying to cordon and drain the node error="nodes \"ip-10-xxx-xxx-xxx.ec2.internal\" not found"
@bwagner5
Copy link
Contributor

There's a couple options that could mitigate this behavior:

  1. Extend the ASG tagging to determine if NTH should manage the ASGs to EC2 Instance tags and only cordon/drain if the tag is present. (pro: pretty simple to implement, but a con is that EC2 api calls would increase)

  2. The cluster-api machine spec implementation for AWS has a separate process that adds instance-ids to the eventbridge rules so that only the instances in your cluster would be sent. I'm not sure this would completely solve the problem, since if you have nodes managed by an ASG, NTH is racing with itself on the cordon/drain action between the lifecycle hook and the ec2 status change.

  3. As you suggested, If the node isn't found in the cluster, we just don't exit. This might be okay, I'd have to think through the ramifications if there was an issue causing a API calls to surge or the process would benefit from restarting (maybe resetting sdk clients)

@bwagner5 bwagner5 added the Type: Bug Something isn't working label Oct 29, 2020
@blakestoddard
Copy link
Contributor Author

blakestoddard commented Oct 29, 2020

I think additional ASG-propagated instance tagging could be a solution. Or perhaps NTHManagedASG becomes configurable?

NTHManagedASG = "aws-node-termination-handler/managed"
This would give you additional control over what NTH installs manage what EC2 instances, without additional calls to the EC2 API beyond what you do now. I don't hate the default exit(1) behavior -- it's not the root cause, it just gets exacerbated by this issue.

@blakestoddard
Copy link
Contributor Author

Unconsidered here is what happens to that SQS message? Does it stay in the queue forever? Does it get dropped if it's found to not be managed by this particular NTH instance?

If it's the latter, I think you could get away with per-cluster SQS queues/events as a path for running multiple NTH installs in a single account.

@bwagner5
Copy link
Contributor

Unconsidered here is what happens to that SQS message? Does it stay in the queue forever? Does it get dropped if it's found to not be managed by this particular NTH instance?

It will not be dropped immediately, but the queue should be set to discard after a time period. The readme of NTH sets it at 5 mins I believe. That way, if there is a problem with the EC2 api, K8s api, or a node is having problems, the message will get retried.

@blakestoddard
Copy link
Contributor Author

Oh sweet, then you could still run multiple NTH installs out of a single queue then as long as NTH isn't deleting items that it fails to process!

@bwagner5
Copy link
Contributor

Well I'm not sure it's a great idea still. It doesn't delete it, but depending on your visibility timeout setting on cloudwatch (I think it's set at 20 sec), you could get unlucky and have the wrong NTH's pulling it every time until it hits the 300 sec deletion time.

@blakestoddard
Copy link
Contributor Author

blakestoddard commented Nov 6, 2020

I put together a rev of the customizable tag in https://github.com/blakestoddard/aws-node-termination-handler/tree/managed-asg-tag and am running it in four of our clusters (non are prod). I'll check-in in a few days to see if it's still being crash-happy.

@paalkr
Copy link

paalkr commented Nov 10, 2020

I would just like to drop in and say that I'm facing the exact same issue. We are running several K8s clusters and auto scaling groups in the same VPC. A dedicated SQS queue and a set of rules are deployed pr cluster, put the ec2 events cannot be filtered based on the Auto Scaling groups and ends up in all queues.

I think a solution where nth just skip processing nodes that are not part of the cluster would work. But there might other better solutions, and ideally the filters should make sure that only relevant messages should end up in a queue.

@bwagner5
Copy link
Contributor

@paalkr
Copy link

paalkr commented Nov 11, 2020

Yes, that's a possibility for the ASG events like
EC2 Instance-terminate Lifecycle Action

But unfortunately its not possible for the ec2 events like
EC2 Instance State-change Notification
EC2 Spot Instance Interruption Warning
EC2 Instance Rebalance Recommendation

Creating dedicated rules pr AutoScaling Group could also quite easily in our use case result in more then 100 EventBridge rules for the default EventBus, which is a hard limit. And it's not possible to use a custom EventBus for AWS resource events.

@bwagner5
Copy link
Contributor

ah gotcha, sorry lost that context from the original post. We will try to get the instance tag check in soon, I think it would be a good idea to fallback to instance check on ASG as well just in case there are rate limit issues with ASG describe-tags call.

@blakestoddard
Copy link
Contributor Author

Do you have any interest in a PR for my approach, too @bwagner5? There's no additional API calls needed (and would be just a general nice-to-have anyway since the current tag key is pretty broad).

main...blakestoddard:managed-asg-tag

@bwagner5
Copy link
Contributor

I'm hesitant to make any more configuration. I'm open to feedback, but it seems simpler to stick with the default key and not allow it to be configurable. I'm not seeing much value being added from making that tag configurable, is there a specific reason?

I was thinking that aws-node-termination-handler/managed would be the general tag key for any resource that NTH is managing whether it is at the instance level or the ASG level where it applies to multiple instances.

@paalkr
Copy link

paalkr commented Nov 12, 2020

So is the plan to just skip processing events for instances that is added to the queue, if the instance is not recognized as a node attached to the current cluster?

@bwagner5
Copy link
Contributor

So is the plan to just skip processing events for instances that is added to the queue, if the instance is not recognized as a node attached to the current cluster?

Skip processing events for nodes that are not tagged with aws-node-termination-handler/managed or reside in an ASG tagged with that tag.

@blakestoddard
Copy link
Contributor Author

blakestoddard commented Nov 12, 2020

is there a specific reason?

Yeah, my approach here doesn't require additional API calls versus also checking the instance tags instead (you also don't have to worry about propagating that tag to existing instances if you add it to an existing ASG). There's less feedback loops to carry through by not having to check tags in two spots. And even with checking the instance tag, you'd still want that to be some level of configurable or else it doesn't solve this issue at all. (since having the tag on all instances doesn't help to solve the issue of X instance being in X cluster instead of Y cluster if it's still the same tag across everything)

I think it's unfair to set this up in such a way that it only works well if there is one cluster per region per account, and having the tag to verify (whether at the instance level or on the ASG) be unconfigurable would result in that currently.

@paalkr
Copy link

paalkr commented Nov 12, 2020

So is the plan to just skip processing events for instances that is added to the queue, if the instance is not recognized as a node attached to the current cluster?

Skip processing events for nodes that are not tagged with aws-node-termination-handler/managed or reside in an ASG tagged with that tag.

That is not a guarantee for the instance being part of the current cluster. We do run several Kops clusters in the same PVC in the same region.

And I as mentioned earlier, these events cannot be filtered by an EventBridge rule. It's not possible for the rule to determine which ASG the instance belong to. So all event for all instances will enter all queues.
EC2 Instance State-change Notification
EC2 Spot Instance Interruption Warning
EC2 Instance Rebalance Recommendation

@bwagner5
Copy link
Contributor

👍 ah gotcha (both of you 🙂 ). I'm cool with the tag configuration approach then.

@paalkr Do you think that would work for your case as well (specifying different tags for each of your clusters)?

aws-node-termination-handler/managed/cluster1
aws-node-termination-handler/managed/cluster2

@paalkr
Copy link

paalkr commented Nov 12, 2020

Yes, that would work. Or even better maybe
aws-node-termination-handler/managed: cluster1
aws-node-termination-handler/managed: cluster2

@bwagner5
Copy link
Contributor

Feel free to PR that change then @blakestoddard 😀

@blakestoddard
Copy link
Contributor Author

PR for the configurable tag change opened. I can change it look for a value, too, but that's some more significant code changes that I'm not sure are worth the complexity.

@paalkr
Copy link

paalkr commented Nov 12, 2020

Excellent! Key or value, it doesn't really matter. Both will work :)

@universam1
Copy link
Contributor

The original problem is not solved yet - the filter is great but it does not apply for all conditions like mentioned in #307

@bwagner5 bwagner5 reopened this Dec 4, 2020
@haugenj
Copy link
Contributor

haugenj commented Dec 14, 2020

This should be fixed with the release of v1.11.1 ! If not please reopen and we'll take another look (and maybe fully remove those os.Exit(1)s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants