-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reaction on "failed to launch a new EC2 instance" event #1263
Comments
Existing functionality includes back-off on a failed node group scale-up attempt. This should cause the other group to be expanded instead, like you describe. If I understand correctly, the problem here is that the request to set number of instances to 7 succeeds, however AWS doesn't start the instance? In this case, Cluster Autoscaler has no way of knowing the instance won't start until the node doesn't register and it considers the scale-up to have timed out (which takes 15-20 minutes), after which it should enter back-off and scale the other group. However, if the target size of the instance group remains 7, it may believe that the scale-up attempt was successful after all (I'll take a look at the code later to verify this). Ideally, the request setting node group's size above limit should fail immediately. This can be probably implemented in cloud provider code by checking for this error after ASG resize. |
@aleksandra-malinowska , thanks for your quick response. Your understanding of the issue is correct.
Don't be disoriented by the Cause ("increasing the capacity from 5 to 6") and the StatusMessage ("You have requested more instances (7)"). It is because I have another running instance of this type in this account, that is not a part of this particular ASG. Successful request to increase the capacity looks like this:
So, if there is some logic that checks the last scaling scaling activity from the ASG after scale-out event and, in case when the StatusCode is "Failed" and the StartTime is later than the scale-out event, scales-in the ASG and tries an alternative instance group, that should take care of this case. |
When I previously looked into this, the biggest issue I saw is correlating the SetDesiredCapacity capacity call and the failed activity as the desired capacity API response doesn't return any identifier to correlate them to activities. But potentially it could be safe to always assume all pending scale ups of an ASG failed if it's activities contained a launch failure. In the worst case there will be more capacity available, which will be scaled down eventually. |
Simple solution would be to do this check as part of AwsNodeGroup.IncreaseSize() call and fail if error is found. This should work seamlessly with current logic, but if it means waiting extra 15 seconds before proceeding with other scaling operations, I'm not sure if it's an acceptable trade-off to you. Otherwise, we may need to think a bit more about updating cluster state with this information in a provider-agnostic way. |
From the above timestamps the duration looks > 15s :-(
My idea would be to fetch the latest activity per ASG during If there is a failure mark the ASG as temporarily unhealthy (how long?) and reset the desired capacity to the current value so the autoscaler is aware that the pending node won't be launched. Looking through the API, I can't find a way how to mark an ASG as unhealthy to the autoscaler internally from the cloudprovider, any pointers? |
Thanks for your comments. Yeah, waiting even 10 extra seconds is not an acceptable trade-off, especially given that the situation we want to handle here is rare per se. |
@Dregathar @johanneswuerbach @gjtempleton @aermakov-zalando and anyone else interested this issue: #1464 adds a (hopefully generic enough) mechanism for handling stockout and quota errors. |
That looks interesting, I've a spike locally parsing the ASG events log, but haven't had time to figure out how to wire this into the cluster registry. Will try to use this and report back. |
Cool, let us know if it can be adapted for AWS. |
@johanneswuerbach did you ever get anywhere with your spike? |
Not really, feel free to take it over @stefansedich The main problem is that events are async and not really easily correlatable with actions on an ASG. Also currently there seems to be no way to mark ASGs unhealthy outside calls to the provider interface. My idea was to add another field to the instance groups, which allows the provider to mark groups as unhealthy during the sync loop, but I haven’t tested due to lack of time |
Yup @johanneswuerbach digging around I am running into the same, was your idea to just check the last event during the loop and if it was a failure to just mark the ASG as being unhealthy? |
Yes and ideally also mark the ASG size increase as failed so the autoscaler tries to create a new node in a different AZ and doesn't wait the full node not coming only timeout. |
@johanneswuerbach I have put together a super raw POC using the described strategy which appears in my first test to do what I want (us-east-2c has no P2 instances available, it will try scale-up in that ASG, fail and then successfully create the node in another AZ with capacity). master...stefansedich:aws-unhealthy-asg I originally tried to use the generic method for handling quota and stockout issues added in It would love to hear from someone with deeper knowledge of the autoscaler with ideas on how we might be able to achieve this and if there is a better direction to go with, @aleksandra-malinowska any pointers as to whom I might be able to talk to? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
* [docs] Generate API docs. * Review remarks. * Add configuration.
Hello,
Our application uses different types of instances as k8s nodes: c5 and i3. Pods running on c5 instances (let's name them 'core pods') use EBS volumes, so these pods are bound to a particular AZ. Pods running on i3 instances (let's name them 'compute pods') don't have EBS volumes and can run in any AZ. However, the application performs better if both core pods and compute pods run in the same AZ. The problem with i3 instances is that they might be lacking in some AZ's at some points in time (I personally saw 2 cases when i3.2xlarge instances were not available in a given AZ, and there was a recommendation in AWS console to create instances in other AZ's). Ideally we want to run compute pods in the same AZ as core pods, but in case when i3 instances are not available in the AZ, it is fine to run instances in another AZ.
To actually configure this scenario and to test how autoscaler would behave, I've created 2 node groups:
Then I've created a deployment with the template spec containing the following:
and container memory has the following configuration:
nodes-1a instance group has maxSize = 10, however, the AWS account has a limit on number of running i3.xlarge instances = 6 (I'm not able to reproduce lacking of instances of a particular type, so, using instance limits fits the purpose of the test)
When the deployment is scaled to 1 replica, cluster autoscaler changes the number of nodes in nodes-1a to 1 node; when the deployment is scaled to 6 replicas, cluster autoscaler changes the number of nodes in nodes-1a to 6 nodes; when the deployment is scaled to 7 replicas, the number of nodes in nodes-1a is changed to 7, however, launching a new instance in the autoscaling group fails with the following description "Launching a new EC2 instance. Status Reason: You have requested more instances (7) than your current instance limit of 6 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed."
At this point the pod is stuck in 'Pending' state indefinitely.
What would be useful in this case is a mechanism to get a "feedback" from the Autoscaling group that launching a new EC2 instance failed, and try to find an alternative instance group (in our case - nodes-1b) to scale it out to be able to schedule the pod on it.
Does this request seem reasonable, or is there a way round the problem using some existing functionality?
The text was updated successfully, but these errors were encountered: