Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws: Graceful handling of EC2 detach errors #10740

Merged

Conversation

hwoarang
Copy link
Contributor

@hwoarang hwoarang commented Feb 5, 2021

What this PR does / why we need it:

Sometimes, we observe the following error during a rolling update:

error detaching instance "i-XXXX", node "ip-10-X-X-X.ec2.internal": error detaching instance "i-XXXX": ValidationError: The instance i-XXXX is not part of Auto Scaling group XXXXX

The sequence of events that lead to this problem is the following:

  • A new ASG object is being built from the launch template
  • Existing instances are being added to it
  • An existing instance is being ignored because it's already terminating
    W0205 08:01:32.593377 191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX
  • Due to maxSurge, the terminating instance is trying to be detached
    from the autoscaling group and fails.

As such, in case of EC@ ASG deatch failures we can simply try to detach
the next node instead of aborting the whole update operation.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 5, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @hwoarang. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 5, 2021
@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Feb 5, 2021
@olemarkus
Copy link
Member

When you see this happens, have all other nodes already been rolled?
If the detached instances must always be the last ones terminating, if not you will no longer have the expected surge.
I.e if the detached instance is the last instance in the IG, I agree it is safe. It is not (perhaps cluster autoscaler decided to terminate it) it is not.

@hwoarang
Copy link
Contributor Author

hwoarang commented Feb 5, 2021

When you see this happens, have all other nodes already been rolled?
If the detached instances must always be the last ones terminating, if not you will no longer have the expected surge.
I.e if the detached instance is the last instance in the IG, I agree it is safe. It is not (perhaps cluster autoscaler decided to terminate it) it is not.

@olemarkus I lost you a bit there as I am not sure what you meant :)

So we saw this error in two different clusters after a kops update from 1.18 -> 1.19. When this breakage happened, and we ran kops update again, it reported that only one instance needed to be updated. So my theory is that the detached instance was not the last one in the ASG that needed to be rolled. I believe that the instance was terminating for any reason (maybe CA maybe something else) but it was not detached from the ASG as I think that terminating an EC2 does not make it detach from the ASG until it's fully terminated. In the meantime, it's id was discovered due to the usual tags and added to the new ASG and later on kops tried to detach in order to bring a fresh one. Does that sound possible?

Or to phrase it in a different way, this error suggests that we tried to detached a non-existing EC2 from ASGI. Isn't normally safe to not hard fail on this and just let things proceed?

@olemarkus
Copy link
Member

This process is a bit complicated. But I was wrong anyways.

The detachment happens in the beginning of a roll. If detachment fails and you continue, you can end up with kops thinking it should drain and terminate 4 nodes at a time, while only 1 node has been detached. i.e you have not surged as much as expected and as a consequence are rolling more nodes at a time than you surged.

Is the number of nodes per ASG very volatile? As far as I can tell, this can only happen by something terminating the instance in the smallish timeframe between kops slecting the candidates for detachment and actually detaching the instance.

Does this make more sense?

As CAS can interfere with the roll anyway, I wonder if kops should disable it during rolls. This is also because something CAS can scale up an ASG before the new masters are ready leading to k8s worker/control plane incompatibilities.

@hwoarang
Copy link
Contributor Author

hwoarang commented Feb 5, 2021

This process is a bit complicated. But I was wrong anyways.

The detachment happens in the beginning of a roll. If detachment fails and you continue, you can end up with kops thinking it should drain and terminate 4 nodes at a time, while only 1 node has been detached. i.e you have not surged as much as expected and as a consequence are rolling more nodes at a time than you surged.

Yes I suspected that the detachment happens at the beginning of the roll but then we saw this

W0205 08:01:32.593377 191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX

by looking at the code, this happens when kops prepares the new ASG with the launch template. It essentially discovers the ids of the instances based on their tags and puts their IDs in the new ASG object. so far so good.

And then we start with the actual IG update and detachment

And the detachment has failed because for whatever reason the EC2 is not in the ASG anymore as between the previous warning, and the actual detachment, the EC2 has vanished (it was already terminating). No?

In this PR we do not ignore any detachment failure. Just the one that claims that the EC2 is not part of ASG. Do you think this is still problematic?

Is the number of nodes per ASG very volatile? As far as I can tell, this can only happen by something terminating the instance in the smallish timeframe between kops slecting the candidates for detachment and actually detaching the instance.

Does this make more sense?

That would also be possible. I am just not sure if we hit that case. As I said before, the timeframe that I think we are looking at is between discovering existing ASG EC2 instances and doing the actual detachment which is potentially much longer than the one you suggest. But in any case, what you say sounds possible as well 👍

Btw, the numbers of nodes in the ASG is not fluctuating a lot in our case.

As CAS can interfere with the roll anyway, I wonder if kops should disable it during rolls. This is also because something CAS can scale up an ASG before the new masters are ready leading to k8s worker/control plane incompatibilities.

That would also be a reasonable thing to do perhaps 🤔

@olemarkus
Copy link
Member

Yeah, I think any failure of detachment will result in insufficient surge. So it is better to start the roll over again. This is less than ideal though.

I wonder if instead of just returning err in pkg/instancegroups/instancegroups.go you could advance to the next instance. Add a skippedNodes counter that is incremented if detach returns error, decrement numSurge, and change u := update[len(update)-numSurge] to be u := update[len(update)-numSurge+skippedNodes].

@johngmyers
Copy link
Member

Is CAS able to choose an instance that has already been detached for scale-down? After it's detached, it won't be in the ASG.

Having CAS be able to scale down during a rolling update effectively increases the maxUnavailable during CAS's draining and has a probability of decreasing the effective maxSurge (by choosing a detached instance, possibly before it was detached). As long as CAS respects PodDisruptionBudgets, I'm not too worried about not strictly following the surge/unavailable limits for such races.

@olemarkus
Copy link
Member

Not already detached instances no.

But I wonder if what happens here is:

  • fetch instance list
  • CAS terminates the instance at the top of the list
  • kOps tries to detach instance

I am also not worried about CAS due to surging once the roll starts. But it is a known issue that someone updates the cluster to a new k8s version, then CAS spins up new nodes before CP is rolled. There are several k8s version where workers cannot join the cluster with older CP versions.

@rifelpet
Copy link
Member

rifelpet commented Feb 5, 2021

the official version policy states that kubelet cannot be newer than kube-apiserver. we had an issue tracking this specific situation: #7323

@johngmyers
Copy link
Member

It's always possible for an ASG to start a worker node before the control plane has been updated. This can happen once the apply_cluster updates the ASG with the new template. I believe it is usually the case that such too-new nodes will subsequently successfully join the cluster once the control plane has updated.

@olemarkus
Copy link
Member

It's always possible for an ASG to start a worker node before the control plane has been updated. This can happen once the apply_cluster updates the ASG with the new template. I believe it is usually the case that such too-new nodes will subsequently successfully join the cluster once the control plane has updated.

No, you typically have had to terminate the nodes in the past.

Anyways, we digress. @johngmyers do you think my concern about skipping any failed detachment makes sense?

@hwoarang
Copy link
Contributor Author

hwoarang commented Feb 6, 2021

Let me state here that we did not update k8s as part of the kops update. We only went from kops 1.18 -> 1.19 and we kept the k8s version the same (1.18.10). We have also updated the AMI on all nodes (099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1). As such, in the end all nodes were rolled.

@johngmyers
Copy link
Member

Instances can be spontaneously terminated by AWS. Also, the market price of spot instances can decrease, causing ASGs to become able to fill desired capacity that previously had too low a bid price.

I see the concern that if the surge value is low, for example "1", then if one is unlucky the update can change from a surge to a deficit. Balancing that are the facts that if CAS is scaling down nodes then there is probably extra capacity anyway and that the detached instances could also get scaled down after the detach is complete. I don't see the value of addressing the deficit only for the case when the scale down happens before detach.

I have been thinking that the detach code could be moved down inside the range update loop. This would allow the code to drain the first instance in parallel with draining the remaining instances. Add a call to Cloud.GetCloudGroups() inside the loop and it could then detect when the detached instances spontaneously terminate. But I'm not sure this is worth it.

I will note that since the detached instances are tainted, they are at increased risk of being scaled down by CAS, especially if the update gets stuck on a PDB for a long time.

@dntosas
Copy link
Contributor

dntosas commented Feb 15, 2021

Any news on this?

If I understand correctly, this PR solves a specific issue while we are talking about the upgrade procedure in a more generic way ^^ Should we merge this and conversate about the cases you described in a different PR/issue?

@olemarkus
Copy link
Member

Sorry for diverging a bit too far from the original topic.

The key concern I have with this PR remains that I see a risk with ignoring failed detachments regardless of reason as it will lead to concurrently rolling more nodes than you surge. You could end up with two nodes being drained, having zero detached nodes, and thus nowhere for the evicted pods to go.

Without this PR, you get a stuck roll, which is annoying, but at least erring on the safe side.

My suggestion to solve this is to have kops try to detach another node instead. We should always have the expected amount of nodes detached.

The problem of an already-detached node terminating is a real problem as well, but out of the scope of this PR/discussion.

@hwoarang
Copy link
Contributor Author

hwoarang commented Mar 5, 2021

No problem @olemarkus I will look into implementing your suggestion

Sometimes, we observe the following error during a rolling update:

error detaching instance "i-XXXX", node "ip-10-X-X-X.ec2.internal": error detaching instance "i-XXXX": ValidationError: The instance i-XXXX is not part of Auto Scaling group XXXXX

The sequence of events that lead to this problem is the following:

- A new ASG object is being built from the launch template
- Existing instances are being added to it
- An existing instance is being ignored because it's already terminating
W0205 08:01:32.593377     191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX
- Due to maxSurge, the terminating instance is trying to be detached
  from the autoscaling group and fails.

As such, in case of EC@ ASG deatch failures we can simply try to detach
the next node instead of aborting the whole update operation.
@hwoarang hwoarang force-pushed the ignore-detached-instances-aws branch from 0724471 to 0a49650 Compare March 5, 2021 13:05
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 5, 2021
@hwoarang
Copy link
Contributor Author

hwoarang commented Mar 5, 2021

I have updated the PR

@hwoarang hwoarang changed the title aws: Ignore already detached EC2 instance aws: Graceful handling of EC2 detach errors Mar 5, 2021
@olemarkus
Copy link
Member

Thanks. This looks good to me.
Will let others have a chance to comment before merging.

/lgtm

/milestone v1.21

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 6, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2021
@olemarkus
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, tchatzig

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2021
@k8s-ci-robot k8s-ci-robot merged commit 8a8a4c8 into kubernetes:master Mar 10, 2021
@hwoarang hwoarang deleted the ignore-detached-instances-aws branch September 30, 2021 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/aws Issues or PRs related to aws provider area/rolling-update cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants