Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reaction on "failed to launch a new EC2 instance" event #1263

Closed
Dregathar opened this issue Sep 20, 2018 · 18 comments
Closed

Reaction on "failed to launch a new EC2 instance" event #1263

Dregathar opened this issue Sep 20, 2018 · 18 comments
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@Dregathar
Copy link

Hello,
Our application uses different types of instances as k8s nodes: c5 and i3. Pods running on c5 instances (let's name them 'core pods') use EBS volumes, so these pods are bound to a particular AZ. Pods running on i3 instances (let's name them 'compute pods') don't have EBS volumes and can run in any AZ. However, the application performs better if both core pods and compute pods run in the same AZ. The problem with i3 instances is that they might be lacking in some AZ's at some points in time (I personally saw 2 cases when i3.2xlarge instances were not available in a given AZ, and there was a recommendation in AWS console to create instances in other AZ's). Ideally we want to run compute pods in the same AZ as core pods, but in case when i3 instances are not available in the AZ, it is fine to run instances in another AZ.
To actually configure this scenario and to test how autoscaler would behave, I've created 2 node groups:

nodes-1a
  machineType: i3.xlarge
  maxSize: 10
  minSize: 0
  nodeLabels:
    instance_macro_group: computenodes
    kops.k8s.io/instancegroup: nodes-1a
	
nodes-1b
  machineType: i3.2xlarge
  maxSize: 10
  minSize: 0
  nodeLabels:
    instance_macro_group: computenodes
    kops.k8s.io/instancegroup: nodes-1b

Then I've created a deployment with the template spec containing the following:

    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: kops.k8s.io/instancegroup
                operator: In
                values:
                - nodes-1a
            weight: 1
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: instance_macro_group
                operator: In
                values:
                - computenodes

and container memory has the following configuration:

        resources:
          requests:
            memory: 25Gi

nodes-1a instance group has maxSize = 10, however, the AWS account has a limit on number of running i3.xlarge instances = 6 (I'm not able to reproduce lacking of instances of a particular type, so, using instance limits fits the purpose of the test)

When the deployment is scaled to 1 replica, cluster autoscaler changes the number of nodes in nodes-1a to 1 node; when the deployment is scaled to 6 replicas, cluster autoscaler changes the number of nodes in nodes-1a to 6 nodes; when the deployment is scaled to 7 replicas, the number of nodes in nodes-1a is changed to 7, however, launching a new instance in the autoscaling group fails with the following description "Launching a new EC2 instance. Status Reason: You have requested more instances (7) than your current instance limit of 6 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed."

At this point the pod is stuck in 'Pending' state indefinitely.
What would be useful in this case is a mechanism to get a "feedback" from the Autoscaling group that launching a new EC2 instance failed, and try to find an alternative instance group (in our case - nodes-1b) to scale it out to be able to schedule the pod on it.
Does this request seem reasonable, or is there a way round the problem using some existing functionality?

@aleksandra-malinowska
Copy link
Contributor

Existing functionality includes back-off on a failed node group scale-up attempt. This should cause the other group to be expanded instead, like you describe.

If I understand correctly, the problem here is that the request to set number of instances to 7 succeeds, however AWS doesn't start the instance? In this case, Cluster Autoscaler has no way of knowing the instance won't start until the node doesn't register and it considers the scale-up to have timed out (which takes 15-20 minutes), after which it should enter back-off and scale the other group. However, if the target size of the instance group remains 7, it may believe that the scale-up attempt was successful after all (I'll take a look at the code later to verify this).

Ideally, the request setting node group's size above limit should fail immediately. This can be probably implemented in cloud provider code by checking for this error after ASG resize.

@aleksandra-malinowska aleksandra-malinowska added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels Sep 20, 2018
@Dregathar
Copy link
Author

@aleksandra-malinowska , thanks for your quick response. Your understanding of the issue is correct.
I looked at how failed scale-out event can be monitored, and e.g. this AWS CLI command shows the last scaling activity for a given ASG:

$ aws autoscaling describe-scaling-activities --auto-scaling-group-name "nodes-1a.kops-dev.k8s.local" --max-items 1
{
    "Activities": [
        {
            "Description": "Launching a new EC2 instance.  Status Reason: You have requested more instances (7) than your current instance limit of 6 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.",
            "AutoScalingGroupName": "nodes-1a.kops-dev.k8s.local",
            "ActivityId": "d25597f4-007b-e8ad-2474-fb39d1b0e21b",
            "Details": "{\"Subnet ID\":\"subnet-0f77a91e1abb847ae\",\"Availability Zone\":\"us-east-1a\"}",
            "StartTime": "2018-09-21T16:39:28.250Z",
            "Progress": 100,
            "EndTime": "2018-09-21T16:39:28Z",
            "Cause": "At 2018-09-21T16:39:26Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 6.",
            "StatusMessage": "You have requested more instances (7) than your current instance limit of 6 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.",
            "StatusCode": "Failed"
        }
    ],
    "NextToken": "eyJOZXh0VG9rZW4iOiBudWxsLCAiYm90b190cnVuY2F0ZV9hbW91bnQiOiAxfQ=="
}

Don't be disoriented by the Cause ("increasing the capacity from 5 to 6") and the StatusMessage ("You have requested more instances (7)"). It is because I have another running instance of this type in this account, that is not a part of this particular ASG.
Also, the scaling-activity with "Failed" StatusCode appears there within some reasonable time (say, less than 15 seconds) after changing the desired number of instances in ASG.

Successful request to increase the capacity looks like this:

$ aws autoscaling describe-scaling-activities --auto-scaling-group-name "nodes-1a.kops-dev.k8s.local" --max-items 1
{
    "Activities": [
        {
            "Description": "Launching a new EC2 instance: i-0d8d1a89d7120ce1d",
            "AutoScalingGroupName": "nodes-1a.kops-dev.k8s.local",
            "ActivityId": "e54597f3-e9c4-ee1e-f1cb-5724ac96be07",
            "Details": "{\"Subnet ID\":\"subnet-0f77a91e1abb847ae\",\"Availability Zone\":\"us-east-1a\"}",
            "StartTime": "2018-09-21T16:33:16.091Z",
            "Progress": 100,
            "EndTime": "2018-09-21T16:33:49Z",
            "Cause": "At 2018-09-21T16:32:48Z a user request explicitly set group desired capacity changing the desired capacity from 1 to 2.  At 2018-09-21T16:33:14Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 2.",
            "StatusCode": "Successful"
        }
    ],
    "NextToken": "eyJOZXh0VG9rZW4iOiBudWxsLCAiYm90b190cnVuY2F0ZV9hbW91bnQiOiAxfQ=="
}

So, if there is some logic that checks the last scaling scaling activity from the ASG after scale-out event and, in case when the StatusCode is "Failed" and the StartTime is later than the scale-out event, scales-in the ASG and tries an alternative instance group, that should take care of this case.

@johanneswuerbach
Copy link
Contributor

When I previously looked into this, the biggest issue I saw is correlating the SetDesiredCapacity capacity call and the failed activity as the desired capacity API response doesn't return any identifier to correlate them to activities.

But potentially it could be safe to always assume all pending scale ups of an ASG failed if it's activities contained a launch failure. In the worst case there will be more capacity available, which will be scaled down eventually.

@aleksandra-malinowska
Copy link
Contributor

Simple solution would be to do this check as part of AwsNodeGroup.IncreaseSize() call and fail if error is found. This should work seamlessly with current logic, but if it means waiting extra 15 seconds before proceeding with other scaling operations, I'm not sure if it's an acceptable trade-off to you. Otherwise, we may need to think a bit more about updating cluster state with this information in a provider-agnostic way.

@johanneswuerbach
Copy link
Contributor

From the above timestamps the duration looks > 15s :-(

"StartTime": "2018-09-21T16:33:16.091Z",
"EndTime": "2018-09-21T16:33:49Z",
"Cause": "At 2018-09-21T16:32:48Z

My idea would be to fetch the latest activity per ASG during Refresh() if desired capacity is greater then the actual capacity (so we limit the amount of additional API calls) and check for errors https://docs.aws.amazon.com/autoscaling/ec2/userguide/CHAP_Troubleshooting.html#RetrievingErrors

If there is a failure mark the ASG as temporarily unhealthy (how long?) and reset the desired capacity to the current value so the autoscaler is aware that the pending node won't be launched.

Looking through the API, I can't find a way how to mark an ASG as unhealthy to the autoscaler internally from the cloudprovider, any pointers?

@Dregathar
Copy link
Author

Dregathar commented Sep 26, 2018

Thanks for your comments. Yeah, waiting even 10 extra seconds is not an acceptable trade-off, especially given that the situation we want to handle here is rare per se.

@aleksandra-malinowska
Copy link
Contributor

@Dregathar @johanneswuerbach @gjtempleton @aermakov-zalando and anyone else interested this issue: #1464 adds a (hopefully generic enough) mechanism for handling stockout and quota errors.

@johanneswuerbach
Copy link
Contributor

That looks interesting, I've a spike locally parsing the ASG events log, but haven't had time to figure out how to wire this into the cluster registry.

Will try to use this and report back.

@aleksandra-malinowska
Copy link
Contributor

Will try to use this and report back.

Cool, let us know if it can be adapted for AWS.

@stefansedich
Copy link
Contributor

@johanneswuerbach did you ever get anywhere with your spike?

@johanneswuerbach
Copy link
Contributor

johanneswuerbach commented Feb 27, 2019

Not really, feel free to take it over @stefansedich

The main problem is that events are async and not really easily correlatable with actions on an ASG. Also currently there seems to be no way to mark ASGs unhealthy outside calls to the provider interface. My idea was to add another field to the instance groups, which allows the provider to mark groups as unhealthy during the sync loop, but I haven’t tested due to lack of time

@stefansedich
Copy link
Contributor

Yup @johanneswuerbach digging around I am running into the same, was your idea to just check the last event during the loop and if it was a failure to just mark the ASG as being unhealthy?

@johanneswuerbach
Copy link
Contributor

Yes and ideally also mark the ASG size increase as failed so the autoscaler tries to create a new node in a different AZ and doesn't wait the full node not coming only timeout.

@stefansedich
Copy link
Contributor

@johanneswuerbach I have put together a super raw POC using the described strategy which appears in my first test to do what I want (us-east-2c has no P2 instances available, it will try scale-up in that ASG, fail and then successfully create the node in another AZ with capacity).

master...stefansedich:aws-unhealthy-asg

I originally tried to use the generic method for handling quota and stockout issues added in
#1464 but I could see how I could get it to play nicely with the async nature of the ASG activities. The first cut was a quick spike and is obviously missing useful error handling and tests and it probably littered with issues but just wanted to get something together to begin a bigger discussion.

It would love to hear from someone with deeper knowledge of the autoscaler with ideas on how we might be able to achieve this and if there is a better direction to go with, @aleksandra-malinowska any pointers as to whom I might be able to talk to?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 28, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
* [docs] Generate API docs.

* Review remarks.

* Add configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants