-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws: Graceful handling of EC2 detach errors #10740
aws: Graceful handling of EC2 detach errors #10740
Conversation
Hi @hwoarang. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When you see this happens, have all other nodes already been rolled? |
@olemarkus I lost you a bit there as I am not sure what you meant :) So we saw this error in two different clusters after a kops update from 1.18 -> 1.19. When this breakage happened, and we ran Or to phrase it in a different way, this error suggests that we tried to detached a non-existing EC2 from ASGI. Isn't normally safe to not hard fail on this and just let things proceed? |
This process is a bit complicated. But I was wrong anyways. The detachment happens in the beginning of a roll. If detachment fails and you continue, you can end up with kops thinking it should drain and terminate 4 nodes at a time, while only 1 node has been detached. i.e you have not surged as much as expected and as a consequence are rolling more nodes at a time than you surged. Is the number of nodes per ASG very volatile? As far as I can tell, this can only happen by something terminating the instance in the smallish timeframe between kops slecting the candidates for detachment and actually detaching the instance. Does this make more sense? As CAS can interfere with the roll anyway, I wonder if kops should disable it during rolls. This is also because something CAS can scale up an ASG before the new masters are ready leading to k8s worker/control plane incompatibilities. |
Yes I suspected that the detachment happens at the beginning of the roll but then we saw this
by looking at the code, this happens when kops prepares the new ASG with the launch template. It essentially discovers the And then we start with the actual IG update and detachment And the detachment has failed because for whatever reason the EC2 is not in the ASG anymore as between the previous warning, and the actual detachment, the EC2 has vanished (it was already terminating). No? In this PR we do not ignore any detachment failure. Just the one that claims that the EC2 is not part of ASG. Do you think this is still problematic?
That would also be possible. I am just not sure if we hit that case. As I said before, the timeframe that I think we are looking at is between discovering existing ASG EC2 instances and doing the actual detachment which is potentially much longer than the one you suggest. But in any case, what you say sounds possible as well 👍 Btw, the numbers of nodes in the ASG is not fluctuating a lot in our case.
That would also be a reasonable thing to do perhaps 🤔 |
Yeah, I think any failure of detachment will result in insufficient surge. So it is better to start the roll over again. This is less than ideal though. I wonder if instead of just returning |
Is CAS able to choose an instance that has already been detached for scale-down? After it's detached, it won't be in the ASG. Having CAS be able to scale down during a rolling update effectively increases the maxUnavailable during CAS's draining and has a probability of decreasing the effective maxSurge (by choosing a detached instance, possibly before it was detached). As long as CAS respects PodDisruptionBudgets, I'm not too worried about not strictly following the surge/unavailable limits for such races. |
Not already detached instances no. But I wonder if what happens here is:
I am also not worried about CAS due to surging once the roll starts. But it is a known issue that someone updates the cluster to a new k8s version, then CAS spins up new nodes before CP is rolled. There are several k8s version where workers cannot join the cluster with older CP versions. |
the official version policy states that kubelet cannot be newer than kube-apiserver. we had an issue tracking this specific situation: #7323 |
It's always possible for an ASG to start a worker node before the control plane has been updated. This can happen once the apply_cluster updates the ASG with the new template. I believe it is usually the case that such too-new nodes will subsequently successfully join the cluster once the control plane has updated. |
No, you typically have had to terminate the nodes in the past. Anyways, we digress. @johngmyers do you think my concern about skipping any failed detachment makes sense? |
Let me state here that we did not update k8s as part of the kops update. We only went from kops 1.18 -> 1.19 and we kept the k8s version the same (1.18.10). We have also updated the AMI on all nodes (099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1). As such, in the end all nodes were rolled. |
Instances can be spontaneously terminated by AWS. Also, the market price of spot instances can decrease, causing ASGs to become able to fill desired capacity that previously had too low a bid price. I see the concern that if the surge value is low, for example "1", then if one is unlucky the update can change from a surge to a deficit. Balancing that are the facts that if CAS is scaling down nodes then there is probably extra capacity anyway and that the detached instances could also get scaled down after the detach is complete. I don't see the value of addressing the deficit only for the case when the scale down happens before detach. I have been thinking that the detach code could be moved down inside the I will note that since the detached instances are tainted, they are at increased risk of being scaled down by CAS, especially if the update gets stuck on a PDB for a long time. |
Any news on this? If I understand correctly, this PR solves a specific issue while we are talking about the upgrade procedure in a more generic way ^^ Should we merge this and conversate about the cases you described in a different PR/issue? |
Sorry for diverging a bit too far from the original topic. The key concern I have with this PR remains that I see a risk with ignoring failed detachments regardless of reason as it will lead to concurrently rolling more nodes than you surge. You could end up with two nodes being drained, having zero detached nodes, and thus nowhere for the evicted pods to go. Without this PR, you get a stuck roll, which is annoying, but at least erring on the safe side. My suggestion to solve this is to have kops try to detach another node instead. We should always have the expected amount of nodes detached. The problem of an already-detached node terminating is a real problem as well, but out of the scope of this PR/discussion. |
No problem @olemarkus I will look into implementing your suggestion |
Sometimes, we observe the following error during a rolling update: error detaching instance "i-XXXX", node "ip-10-X-X-X.ec2.internal": error detaching instance "i-XXXX": ValidationError: The instance i-XXXX is not part of Auto Scaling group XXXXX The sequence of events that lead to this problem is the following: - A new ASG object is being built from the launch template - Existing instances are being added to it - An existing instance is being ignored because it's already terminating W0205 08:01:32.593377 191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX - Due to maxSurge, the terminating instance is trying to be detached from the autoscaling group and fails. As such, in case of EC@ ASG deatch failures we can simply try to detach the next node instead of aborting the whole update operation.
0724471
to
0a49650
Compare
I have updated the PR |
Thanks. This looks good to me. /lgtm /milestone v1.21 |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: olemarkus, tchatzig The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Sometimes, we observe the following error during a rolling update:
error detaching instance "i-XXXX", node "ip-10-X-X-X.ec2.internal": error detaching instance "i-XXXX": ValidationError: The instance i-XXXX is not part of Auto Scaling group XXXXX
The sequence of events that lead to this problem is the following:
W0205 08:01:32.593377 191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX
from the autoscaling group and fails.
As such, in case of EC@ ASG deatch failures we can simply try to detach
the next node instead of aborting the whole update operation.