-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing to drain/cordon causes CrashLoopBackOff #272
Comments
There's a couple options that could mitigate this behavior:
|
I think additional ASG-propagated instance tagging could be a solution. Or perhaps
exit(1) behavior -- it's not the root cause, it just gets exacerbated by this issue.
|
Unconsidered here is what happens to that SQS message? Does it stay in the queue forever? Does it get dropped if it's found to not be managed by this particular NTH instance? If it's the latter, I think you could get away with per-cluster SQS queues/events as a path for running multiple NTH installs in a single account. |
It will not be dropped immediately, but the queue should be set to discard after a time period. The readme of NTH sets it at 5 mins I believe. That way, if there is a problem with the EC2 api, K8s api, or a node is having problems, the message will get retried. |
Oh sweet, then you could still run multiple NTH installs out of a single queue then as long as NTH isn't deleting items that it fails to process! |
Well I'm not sure it's a great idea still. It doesn't delete it, but depending on your visibility timeout setting on cloudwatch (I think it's set at 20 sec), you could get unlucky and have the wrong NTH's pulling it every time until it hits the 300 sec deletion time. |
I put together a rev of the customizable tag in https://github.com/blakestoddard/aws-node-termination-handler/tree/managed-asg-tag and am running it in four of our clusters (non are prod). I'll check-in in a few days to see if it's still being crash-happy. |
I would just like to drop in and say that I'm facing the exact same issue. We are running several K8s clusters and auto scaling groups in the same VPC. A dedicated SQS queue and a set of rules are deployed pr cluster, put the ec2 events cannot be filtered based on the Auto Scaling groups and ends up in all queues. I think a solution where nth just skip processing nodes that are not part of the cluster would work. But there might other better solutions, and ideally the filters should make sure that only relevant messages should end up in a queue. |
good feedback @paalkr! Can you scope your rule to only send events from certain ASGs? https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutRule.html |
Yes, that's a possibility for the ASG events like But unfortunately its not possible for the ec2 events like Creating dedicated rules pr AutoScaling Group could also quite easily in our use case result in more then 100 EventBridge rules for the default EventBus, which is a hard limit. And it's not possible to use a custom EventBus for AWS resource events. |
ah gotcha, sorry lost that context from the original post. We will try to get the instance tag check in soon, I think it would be a good idea to fallback to instance check on ASG as well just in case there are rate limit issues with ASG describe-tags call. |
Do you have any interest in a PR for my approach, too @bwagner5? There's no additional API calls needed (and would be just a general nice-to-have anyway since the current tag key is pretty broad). |
I'm hesitant to make any more configuration. I'm open to feedback, but it seems simpler to stick with the default key and not allow it to be configurable. I'm not seeing much value being added from making that tag configurable, is there a specific reason? I was thinking that |
So is the plan to just skip processing events for instances that is added to the queue, if the instance is not recognized as a node attached to the current cluster? |
Skip processing events for nodes that are not tagged with |
Yeah, my approach here doesn't require additional API calls versus also checking the instance tags instead (you also don't have to worry about propagating that tag to existing instances if you add it to an existing ASG). There's less feedback loops to carry through by not having to check tags in two spots. And even with checking the instance tag, you'd still want that to be some level of configurable or else it doesn't solve this issue at all. (since having the tag on all instances doesn't help to solve the issue of X instance being in X cluster instead of Y cluster if it's still the same tag across everything) I think it's unfair to set this up in such a way that it only works well if there is one cluster per region per account, and having the tag to verify (whether at the instance level or on the ASG) be unconfigurable would result in that currently. |
That is not a guarantee for the instance being part of the current cluster. We do run several Kops clusters in the same PVC in the same region. And I as mentioned earlier, these events cannot be filtered by an EventBridge rule. It's not possible for the rule to determine which ASG the instance belong to. So all event for all instances will enter all queues. |
👍 ah gotcha (both of you 🙂 ). I'm cool with the tag configuration approach then. @paalkr Do you think that would work for your case as well (specifying different tags for each of your clusters)?
|
Yes, that would work. Or even better maybe |
Feel free to PR that change then @blakestoddard 😀 |
PR for the configurable tag change opened. I can change it look for a value, too, but that's some more significant code changes that I'm not sure are worth the complexity. |
Excellent! Key or value, it doesn't really matter. Both will work :) |
The original problem is not solved yet - the filter is great but it does not apply for all conditions like mentioned in #307 |
This should be fixed with the release of v1.11.1 ! If not please reopen and we'll take another look (and maybe fully remove those os.Exit(1)s) |
[Queue processor specific]
Running into issue here around the behavior of exiting with an error code of
1
when a message is pulled from SQS and then cordoning or draining of that node fails.In our use-case, we have multiple EKS clusters in an account in a region. When approaching how to handle this with NTH, I was going to run per-cluster SQS queues with CloudWatch events + rules that were tailored to the proper cluster/ASG combo, and this works for ASG events, but EC2 termination events do not have the same filtering capacity -- meaning that you end up with events in SQS queues even though those instances may not be in the cluster. When NTH processes one of those, it will
exit(1)
because there will be no node in the cluster to cordon or drain. After so many, Kubernetes will mark the pod in aCrashLoopBackOff
state and you will lose all NTH capabilities until the cool-down period expires and Kubernetes starts the pod again.For an NTH pod that I've been running for a week, it's seen 600+ crashes:
The text was updated successfully, but these errors were encountered: