aws: Graceful handling of EC2 detach errors #10740

hwoarang · 2021-02-05T12:30:05Z

What this PR does / why we need it:

Sometimes, we observe the following error during a rolling update:

error detaching instance "i-XXXX", node "ip-10-X-X-X.ec2.internal": error detaching instance "i-XXXX": ValidationError: The instance i-XXXX is not part of Auto Scaling group XXXXX

The sequence of events that lead to this problem is the following:

A new ASG object is being built from the launch template
Existing instances are being added to it
An existing instance is being ignored because it's already terminating
W0205 08:01:32.593377 191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX
Due to maxSurge, the terminating instance is trying to be detached
from the autoscaling group and fails.

As such, in case of EC@ ASG deatch failures we can simply try to detach
the next node instead of aborting the whole update operation.

k8s-ci-robot · 2021-02-05T12:30:12Z

Hi @hwoarang. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

olemarkus · 2021-02-05T12:49:42Z

When you see this happens, have all other nodes already been rolled?
If the detached instances must always be the last ones terminating, if not you will no longer have the expected surge.
I.e if the detached instance is the last instance in the IG, I agree it is safe. It is not (perhaps cluster autoscaler decided to terminate it) it is not.

hwoarang · 2021-02-05T12:57:57Z

When you see this happens, have all other nodes already been rolled?
If the detached instances must always be the last ones terminating, if not you will no longer have the expected surge.
I.e if the detached instance is the last instance in the IG, I agree it is safe. It is not (perhaps cluster autoscaler decided to terminate it) it is not.

@olemarkus I lost you a bit there as I am not sure what you meant :)

So we saw this error in two different clusters after a kops update from 1.18 -> 1.19. When this breakage happened, and we ran kops update again, it reported that only one instance needed to be updated. So my theory is that the detached instance was not the last one in the ASG that needed to be rolled. I believe that the instance was terminating for any reason (maybe CA maybe something else) but it was not detached from the ASG as I think that terminating an EC2 does not make it detach from the ASG until it's fully terminated. In the meantime, it's id was discovered due to the usual tags and added to the new ASG and later on kops tried to detach in order to bring a fresh one. Does that sound possible?

Or to phrase it in a different way, this error suggests that we tried to detached a non-existing EC2 from ASGI. Isn't normally safe to not hard fail on this and just let things proceed?

olemarkus · 2021-02-05T13:36:21Z

This process is a bit complicated. But I was wrong anyways.

The detachment happens in the beginning of a roll. If detachment fails and you continue, you can end up with kops thinking it should drain and terminate 4 nodes at a time, while only 1 node has been detached. i.e you have not surged as much as expected and as a consequence are rolling more nodes at a time than you surged.

Is the number of nodes per ASG very volatile? As far as I can tell, this can only happen by something terminating the instance in the smallish timeframe between kops slecting the candidates for detachment and actually detaching the instance.

Does this make more sense?

As CAS can interfere with the roll anyway, I wonder if kops should disable it during rolls. This is also because something CAS can scale up an ASG before the new masters are ready leading to k8s worker/control plane incompatibilities.

hwoarang · 2021-02-05T13:47:50Z

This process is a bit complicated. But I was wrong anyways.

The detachment happens in the beginning of a roll. If detachment fails and you continue, you can end up with kops thinking it should drain and terminate 4 nodes at a time, while only 1 node has been detached. i.e you have not surged as much as expected and as a consequence are rolling more nodes at a time than you surged.

Yes I suspected that the detachment happens at the beginning of the roll but then we saw this

W0205 08:01:32.593377 191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX

by looking at the code, this happens when kops prepares the new ASG with the launch template. It essentially discovers the ids of the instances based on their tags and puts their IDs in the new ASG object. so far so good.

And then we start with the actual IG update and detachment

And the detachment has failed because for whatever reason the EC2 is not in the ASG anymore as between the previous warning, and the actual detachment, the EC2 has vanished (it was already terminating). No?

In this PR we do not ignore any detachment failure. Just the one that claims that the EC2 is not part of ASG. Do you think this is still problematic?

Is the number of nodes per ASG very volatile? As far as I can tell, this can only happen by something terminating the instance in the smallish timeframe between kops slecting the candidates for detachment and actually detaching the instance.

Does this make more sense?

That would also be possible. I am just not sure if we hit that case. As I said before, the timeframe that I think we are looking at is between discovering existing ASG EC2 instances and doing the actual detachment which is potentially much longer than the one you suggest. But in any case, what you say sounds possible as well 👍

Btw, the numbers of nodes in the ASG is not fluctuating a lot in our case.

As CAS can interfere with the roll anyway, I wonder if kops should disable it during rolls. This is also because something CAS can scale up an ASG before the new masters are ready leading to k8s worker/control plane incompatibilities.

That would also be a reasonable thing to do perhaps 🤔

olemarkus · 2021-02-05T14:44:16Z

Yeah, I think any failure of detachment will result in insufficient surge. So it is better to start the roll over again. This is less than ideal though.

I wonder if instead of just returning err in pkg/instancegroups/instancegroups.go you could advance to the next instance. Add a skippedNodes counter that is incremented if detach returns error, decrement numSurge, and change u := update[len(update)-numSurge] to be u := update[len(update)-numSurge+skippedNodes].

johngmyers · 2021-02-05T19:02:29Z

Is CAS able to choose an instance that has already been detached for scale-down? After it's detached, it won't be in the ASG.

Having CAS be able to scale down during a rolling update effectively increases the maxUnavailable during CAS's draining and has a probability of decreasing the effective maxSurge (by choosing a detached instance, possibly before it was detached). As long as CAS respects PodDisruptionBudgets, I'm not too worried about not strictly following the surge/unavailable limits for such races.

olemarkus · 2021-02-05T20:26:58Z

Not already detached instances no.

But I wonder if what happens here is:

fetch instance list
CAS terminates the instance at the top of the list
kOps tries to detach instance

I am also not worried about CAS due to surging once the roll starts. But it is a known issue that someone updates the cluster to a new k8s version, then CAS spins up new nodes before CP is rolled. There are several k8s version where workers cannot join the cluster with older CP versions.

rifelpet · 2021-02-05T22:54:12Z

the official version policy states that kubelet cannot be newer than kube-apiserver. we had an issue tracking this specific situation: #7323

johngmyers · 2021-02-06T00:26:25Z

It's always possible for an ASG to start a worker node before the control plane has been updated. This can happen once the apply_cluster updates the ASG with the new template. I believe it is usually the case that such too-new nodes will subsequently successfully join the cluster once the control plane has updated.

olemarkus · 2021-02-06T06:47:58Z

It's always possible for an ASG to start a worker node before the control plane has been updated. This can happen once the apply_cluster updates the ASG with the new template. I believe it is usually the case that such too-new nodes will subsequently successfully join the cluster once the control plane has updated.

No, you typically have had to terminate the nodes in the past.

Anyways, we digress. @johngmyers do you think my concern about skipping any failed detachment makes sense?

hwoarang · 2021-02-06T07:25:14Z

Let me state here that we did not update k8s as part of the kops update. We only went from kops 1.18 -> 1.19 and we kept the k8s version the same (1.18.10). We have also updated the AMI on all nodes (099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1). As such, in the end all nodes were rolled.

johngmyers · 2021-02-06T19:18:52Z

Instances can be spontaneously terminated by AWS. Also, the market price of spot instances can decrease, causing ASGs to become able to fill desired capacity that previously had too low a bid price.

I see the concern that if the surge value is low, for example "1", then if one is unlucky the update can change from a surge to a deficit. Balancing that are the facts that if CAS is scaling down nodes then there is probably extra capacity anyway and that the detached instances could also get scaled down after the detach is complete. I don't see the value of addressing the deficit only for the case when the scale down happens before detach.

I have been thinking that the detach code could be moved down inside the range update loop. This would allow the code to drain the first instance in parallel with draining the remaining instances. Add a call to Cloud.GetCloudGroups() inside the loop and it could then detect when the detached instances spontaneously terminate. But I'm not sure this is worth it.

I will note that since the detached instances are tainted, they are at increased risk of being scaled down by CAS, especially if the update gets stuck on a PDB for a long time.

dntosas · 2021-02-15T11:38:44Z

Any news on this?

If I understand correctly, this PR solves a specific issue while we are talking about the upgrade procedure in a more generic way ^^ Should we merge this and conversate about the cases you described in a different PR/issue?

olemarkus · 2021-02-15T13:19:54Z

Sorry for diverging a bit too far from the original topic.

The key concern I have with this PR remains that I see a risk with ignoring failed detachments regardless of reason as it will lead to concurrently rolling more nodes than you surge. You could end up with two nodes being drained, having zero detached nodes, and thus nowhere for the evicted pods to go.

Without this PR, you get a stuck roll, which is annoying, but at least erring on the safe side.

My suggestion to solve this is to have kops try to detach another node instead. We should always have the expected amount of nodes detached.

The problem of an already-detached node terminating is a real problem as well, but out of the scope of this PR/discussion.

hwoarang · 2021-03-05T12:42:57Z

No problem @olemarkus I will look into implementing your suggestion

Sometimes, we observe the following error during a rolling update: error detaching instance "i-XXXX", node "ip-10-X-X-X.ec2.internal": error detaching instance "i-XXXX": ValidationError: The instance i-XXXX is not part of Auto Scaling group XXXXX The sequence of events that lead to this problem is the following: - A new ASG object is being built from the launch template - Existing instances are being added to it - An existing instance is being ignored because it's already terminating W0205 08:01:32.593377 191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX - Due to maxSurge, the terminating instance is trying to be detached from the autoscaling group and fails. As such, in case of EC@ ASG deatch failures we can simply try to detach the next node instead of aborting the whole update operation.

hwoarang · 2021-03-05T13:06:08Z

I have updated the PR

olemarkus · 2021-03-06T07:25:28Z

Thanks. This looks good to me.
Will let others have a chance to comment before merging.

/lgtm

/milestone v1.21

olemarkus · 2021-03-10T17:07:18Z

/approve

k8s-ci-robot · 2021-03-10T17:07:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, tchatzig

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [olemarkus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 5, 2021

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 5, 2021

k8s-ci-robot requested review from johngmyers and rdrgmnzs February 5, 2021 12:30

k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Feb 5, 2021

tchatzig approved these changes Feb 5, 2021

View reviewed changes

hwoarang force-pushed the ignore-detached-instances-aws branch from 0724471 to 0a49650 Compare March 5, 2021 13:05

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 5, 2021

k8s-ci-robot added the area/rolling-update label Mar 5, 2021

hwoarang changed the title ~~aws: Ignore already detached EC2 instance~~ aws: Graceful handling of EC2 detach errors Mar 5, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 6, 2021

k8s-ci-robot assigned olemarkus Mar 6, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2021

k8s-ci-robot merged commit 8a8a4c8 into kubernetes:master Mar 10, 2021

johngmyers mentioned this pull request Jun 15, 2021

Rolling update panic #11774

Closed

hwoarang deleted the ignore-detached-instances-aws branch September 30, 2021 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws: Graceful handling of EC2 detach errors #10740

aws: Graceful handling of EC2 detach errors #10740

hwoarang commented Feb 5, 2021 •

edited

Loading

k8s-ci-robot commented Feb 5, 2021

olemarkus commented Feb 5, 2021

hwoarang commented Feb 5, 2021 •

edited

Loading

olemarkus commented Feb 5, 2021

hwoarang commented Feb 5, 2021

olemarkus commented Feb 5, 2021

johngmyers commented Feb 5, 2021

olemarkus commented Feb 5, 2021

rifelpet commented Feb 5, 2021

johngmyers commented Feb 6, 2021

olemarkus commented Feb 6, 2021

hwoarang commented Feb 6, 2021

johngmyers commented Feb 6, 2021

dntosas commented Feb 15, 2021

olemarkus commented Feb 15, 2021

hwoarang commented Mar 5, 2021

hwoarang commented Mar 5, 2021

olemarkus commented Mar 6, 2021

olemarkus commented Mar 10, 2021

k8s-ci-robot commented Mar 10, 2021

aws: Graceful handling of EC2 detach errors #10740

aws: Graceful handling of EC2 detach errors #10740

Conversation

hwoarang commented Feb 5, 2021 • edited Loading

k8s-ci-robot commented Feb 5, 2021

olemarkus commented Feb 5, 2021

hwoarang commented Feb 5, 2021 • edited Loading

olemarkus commented Feb 5, 2021

hwoarang commented Feb 5, 2021

olemarkus commented Feb 5, 2021

johngmyers commented Feb 5, 2021

olemarkus commented Feb 5, 2021

rifelpet commented Feb 5, 2021

johngmyers commented Feb 6, 2021

olemarkus commented Feb 6, 2021

hwoarang commented Feb 6, 2021

johngmyers commented Feb 6, 2021

dntosas commented Feb 15, 2021

olemarkus commented Feb 15, 2021

hwoarang commented Mar 5, 2021

hwoarang commented Mar 5, 2021

olemarkus commented Mar 6, 2021

olemarkus commented Mar 10, 2021

k8s-ci-robot commented Mar 10, 2021

hwoarang commented Feb 5, 2021 •

edited

Loading

hwoarang commented Feb 5, 2021 •

edited

Loading