Reaction on "failed to launch a new EC2 instance" event #1263

Dregathar · 2018-09-20T16:51:49Z

Hello,
Our application uses different types of instances as k8s nodes: c5 and i3. Pods running on c5 instances (let's name them 'core pods') use EBS volumes, so these pods are bound to a particular AZ. Pods running on i3 instances (let's name them 'compute pods') don't have EBS volumes and can run in any AZ. However, the application performs better if both core pods and compute pods run in the same AZ. The problem with i3 instances is that they might be lacking in some AZ's at some points in time (I personally saw 2 cases when i3.2xlarge instances were not available in a given AZ, and there was a recommendation in AWS console to create instances in other AZ's). Ideally we want to run compute pods in the same AZ as core pods, but in case when i3 instances are not available in the AZ, it is fine to run instances in another AZ.
To actually configure this scenario and to test how autoscaler would behave, I've created 2 node groups:

nodes-1a
  machineType: i3.xlarge
  maxSize: 10
  minSize: 0
  nodeLabels:
    instance_macro_group: computenodes
    kops.k8s.io/instancegroup: nodes-1a
	
nodes-1b
  machineType: i3.2xlarge
  maxSize: 10
  minSize: 0
  nodeLabels:
    instance_macro_group: computenodes
    kops.k8s.io/instancegroup: nodes-1b

Then I've created a deployment with the template spec containing the following:

    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: kops.k8s.io/instancegroup
                operator: In
                values:
                - nodes-1a
            weight: 1
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: instance_macro_group
                operator: In
                values:
                - computenodes

and container memory has the following configuration:

        resources:
          requests:
            memory: 25Gi

nodes-1a instance group has maxSize = 10, however, the AWS account has a limit on number of running i3.xlarge instances = 6 (I'm not able to reproduce lacking of instances of a particular type, so, using instance limits fits the purpose of the test)

When the deployment is scaled to 1 replica, cluster autoscaler changes the number of nodes in nodes-1a to 1 node; when the deployment is scaled to 6 replicas, cluster autoscaler changes the number of nodes in nodes-1a to 6 nodes; when the deployment is scaled to 7 replicas, the number of nodes in nodes-1a is changed to 7, however, launching a new instance in the autoscaling group fails with the following description "Launching a new EC2 instance. Status Reason: You have requested more instances (7) than your current instance limit of 6 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed."

At this point the pod is stuck in 'Pending' state indefinitely.
What would be useful in this case is a mechanism to get a "feedback" from the Autoscaling group that launching a new EC2 instance failed, and try to find an alternative instance group (in our case - nodes-1b) to scale it out to be able to schedule the pod on it.
Does this request seem reasonable, or is there a way round the problem using some existing functionality?

The text was updated successfully, but these errors were encountered:

aleksandra-malinowska · 2018-09-20T17:16:30Z

Existing functionality includes back-off on a failed node group scale-up attempt. This should cause the other group to be expanded instead, like you describe.

If I understand correctly, the problem here is that the request to set number of instances to 7 succeeds, however AWS doesn't start the instance? In this case, Cluster Autoscaler has no way of knowing the instance won't start until the node doesn't register and it considers the scale-up to have timed out (which takes 15-20 minutes), after which it should enter back-off and scale the other group. However, if the target size of the instance group remains 7, it may believe that the scale-up attempt was successful after all (I'll take a look at the code later to verify this).

Ideally, the request setting node group's size above limit should fail immediately. This can be probably implemented in cloud provider code by checking for this error after ASG resize.

Dregathar · 2018-09-21T16:56:58Z

@aleksandra-malinowska , thanks for your quick response. Your understanding of the issue is correct.
I looked at how failed scale-out event can be monitored, and e.g. this AWS CLI command shows the last scaling activity for a given ASG:

$ aws autoscaling describe-scaling-activities --auto-scaling-group-name "nodes-1a.kops-dev.k8s.local" --max-items 1
{
    "Activities": [
        {
            "Description": "Launching a new EC2 instance.  Status Reason: You have requested more instances (7) than your current instance limit of 6 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.",
            "AutoScalingGroupName": "nodes-1a.kops-dev.k8s.local",
            "ActivityId": "d25597f4-007b-e8ad-2474-fb39d1b0e21b",
            "Details": "{\"Subnet ID\":\"subnet-0f77a91e1abb847ae\",\"Availability Zone\":\"us-east-1a\"}",
            "StartTime": "2018-09-21T16:39:28.250Z",
            "Progress": 100,
            "EndTime": "2018-09-21T16:39:28Z",
            "Cause": "At 2018-09-21T16:39:26Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 6.",
            "StatusMessage": "You have requested more instances (7) than your current instance limit of 6 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.",
            "StatusCode": "Failed"
        }
    ],
    "NextToken": "eyJOZXh0VG9rZW4iOiBudWxsLCAiYm90b190cnVuY2F0ZV9hbW91bnQiOiAxfQ=="
}

Don't be disoriented by the Cause ("increasing the capacity from 5 to 6") and the StatusMessage ("You have requested more instances (7)"). It is because I have another running instance of this type in this account, that is not a part of this particular ASG.
Also, the scaling-activity with "Failed" StatusCode appears there within some reasonable time (say, less than 15 seconds) after changing the desired number of instances in ASG.

Successful request to increase the capacity looks like this:

$ aws autoscaling describe-scaling-activities --auto-scaling-group-name "nodes-1a.kops-dev.k8s.local" --max-items 1
{
    "Activities": [
        {
            "Description": "Launching a new EC2 instance: i-0d8d1a89d7120ce1d",
            "AutoScalingGroupName": "nodes-1a.kops-dev.k8s.local",
            "ActivityId": "e54597f3-e9c4-ee1e-f1cb-5724ac96be07",
            "Details": "{\"Subnet ID\":\"subnet-0f77a91e1abb847ae\",\"Availability Zone\":\"us-east-1a\"}",
            "StartTime": "2018-09-21T16:33:16.091Z",
            "Progress": 100,
            "EndTime": "2018-09-21T16:33:49Z",
            "Cause": "At 2018-09-21T16:32:48Z a user request explicitly set group desired capacity changing the desired capacity from 1 to 2.  At 2018-09-21T16:33:14Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 2.",
            "StatusCode": "Successful"
        }
    ],
    "NextToken": "eyJOZXh0VG9rZW4iOiBudWxsLCAiYm90b190cnVuY2F0ZV9hbW91bnQiOiAxfQ=="
}

So, if there is some logic that checks the last scaling scaling activity from the ASG after scale-out event and, in case when the StatusCode is "Failed" and the StartTime is later than the scale-out event, scales-in the ASG and tries an alternative instance group, that should take care of this case.

johanneswuerbach · 2018-09-23T22:02:53Z

When I previously looked into this, the biggest issue I saw is correlating the SetDesiredCapacity capacity call and the failed activity as the desired capacity API response doesn't return any identifier to correlate them to activities.

But potentially it could be safe to always assume all pending scale ups of an ASG failed if it's activities contained a launch failure. In the worst case there will be more capacity available, which will be scaled down eventually.

aleksandra-malinowska · 2018-09-24T09:26:29Z

Simple solution would be to do this check as part of AwsNodeGroup.IncreaseSize() call and fail if error is found. This should work seamlessly with current logic, but if it means waiting extra 15 seconds before proceeding with other scaling operations, I'm not sure if it's an acceptable trade-off to you. Otherwise, we may need to think a bit more about updating cluster state with this information in a provider-agnostic way.

johanneswuerbach · 2018-09-24T10:39:04Z

From the above timestamps the duration looks > 15s :-(

"StartTime": "2018-09-21T16:33:16.091Z",
"EndTime": "2018-09-21T16:33:49Z",
"Cause": "At 2018-09-21T16:32:48Z

My idea would be to fetch the latest activity per ASG during Refresh() if desired capacity is greater then the actual capacity (so we limit the amount of additional API calls) and check for errors https://docs.aws.amazon.com/autoscaling/ec2/userguide/CHAP_Troubleshooting.html#RetrievingErrors

If there is a failure mark the ASG as temporarily unhealthy (how long?) and reset the desired capacity to the current value so the autoscaler is aware that the pending node won't be launched.

Looking through the API, I can't find a way how to mark an ASG as unhealthy to the autoscaler internally from the cloudprovider, any pointers?

Dregathar · 2018-09-26T15:26:56Z

Thanks for your comments. Yeah, waiting even 10 extra seconds is not an acceptable trade-off, especially given that the situation we want to handle here is rare per se.

aleksandra-malinowska · 2018-12-06T14:41:30Z

@Dregathar @johanneswuerbach @gjtempleton @aermakov-zalando and anyone else interested this issue: #1464 adds a (hopefully generic enough) mechanism for handling stockout and quota errors.

johanneswuerbach · 2018-12-06T14:43:53Z

That looks interesting, I've a spike locally parsing the ASG events log, but haven't had time to figure out how to wire this into the cluster registry.

Will try to use this and report back.

aleksandra-malinowska · 2018-12-06T14:46:34Z

Will try to use this and report back.

Cool, let us know if it can be adapted for AWS.

stefansedich · 2019-02-27T02:26:49Z

@johanneswuerbach did you ever get anywhere with your spike?

johanneswuerbach · 2019-02-27T06:55:24Z

Not really, feel free to take it over @stefansedich

The main problem is that events are async and not really easily correlatable with actions on an ASG. Also currently there seems to be no way to mark ASGs unhealthy outside calls to the provider interface. My idea was to add another field to the instance groups, which allows the provider to mark groups as unhealthy during the sync loop, but I haven’t tested due to lack of time

stefansedich · 2019-02-27T07:18:55Z

Yup @johanneswuerbach digging around I am running into the same, was your idea to just check the last event during the loop and if it was a failure to just mark the ASG as being unhealthy?

johanneswuerbach · 2019-02-27T10:45:02Z

Yes and ideally also mark the ASG size increase as failed so the autoscaler tries to create a new node in a different AZ and doesn't wait the full node not coming only timeout.

stefansedich · 2019-02-28T22:27:45Z

@johanneswuerbach I have put together a super raw POC using the described strategy which appears in my first test to do what I want (us-east-2c has no P2 instances available, it will try scale-up in that ASG, fail and then successfully create the node in another AZ with capacity).

master...stefansedich:aws-unhealthy-asg

I originally tried to use the generic method for handling quota and stockout issues added in
#1464 but I could see how I could get it to play nicely with the async nature of the ASG activities. The first cut was a quick spike and is obviously missing useful error handling and tests and it probably littered with issues but just wanted to get something together to begin a bigger discussion.

It would love to hear from someone with deeper knowledge of the autoscaler with ideas on how we might be able to achieve this and if there is a better direction to go with, @aleksandra-malinowska any pointers as to whom I might be able to talk to?

fejta-bot · 2019-05-29T22:33:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-06-28T23:20:21Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-07-29T00:06:08Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-07-29T00:06:15Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

* [docs] Generate API docs. * Review remarks. * Add configuration.

aleksandra-malinowska added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels Sep 20, 2018

johanneswuerbach mentioned this issue Sep 26, 2018

Start docs around idempotent AWS resource management kubernetes-sigs/cluster-api-provider-aws#66

Merged

johanneswuerbach mentioned this issue Nov 15, 2018

Cluster Autoscalar does not show ASG Scaling Activity History #1402

Closed

aleksandra-malinowska mentioned this issue May 10, 2019

fix: correctly handle lack of capacity of AWS spot ASGs #2008

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 28, 2019

k8s-ci-robot closed this as completed Jul 29, 2019

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024

[docs] Generate API docs. (kubernetes#1263)

32d15ec

* [docs] Generate API docs. * Review remarks. * Add configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reaction on "failed to launch a new EC2 instance" event #1263

Reaction on "failed to launch a new EC2 instance" event #1263

Dregathar commented Sep 20, 2018

aleksandra-malinowska commented Sep 20, 2018

Dregathar commented Sep 21, 2018

johanneswuerbach commented Sep 23, 2018

aleksandra-malinowska commented Sep 24, 2018

johanneswuerbach commented Sep 24, 2018

Dregathar commented Sep 26, 2018 •

edited

Loading

aleksandra-malinowska commented Dec 6, 2018

johanneswuerbach commented Dec 6, 2018

aleksandra-malinowska commented Dec 6, 2018

stefansedich commented Feb 27, 2019

johanneswuerbach commented Feb 27, 2019 •

edited

Loading

stefansedich commented Feb 27, 2019

johanneswuerbach commented Feb 27, 2019

stefansedich commented Feb 28, 2019

fejta-bot commented May 29, 2019

fejta-bot commented Jun 28, 2019

fejta-bot commented Jul 29, 2019

k8s-ci-robot commented Jul 29, 2019

Reaction on "failed to launch a new EC2 instance" event #1263

Reaction on "failed to launch a new EC2 instance" event #1263

Comments

Dregathar commented Sep 20, 2018

aleksandra-malinowska commented Sep 20, 2018

Dregathar commented Sep 21, 2018

johanneswuerbach commented Sep 23, 2018

aleksandra-malinowska commented Sep 24, 2018

johanneswuerbach commented Sep 24, 2018

Dregathar commented Sep 26, 2018 • edited Loading

aleksandra-malinowska commented Dec 6, 2018

johanneswuerbach commented Dec 6, 2018

aleksandra-malinowska commented Dec 6, 2018

stefansedich commented Feb 27, 2019

johanneswuerbach commented Feb 27, 2019 • edited Loading

stefansedich commented Feb 27, 2019

johanneswuerbach commented Feb 27, 2019

stefansedich commented Feb 28, 2019

fejta-bot commented May 29, 2019

fejta-bot commented Jun 28, 2019

fejta-bot commented Jul 29, 2019

k8s-ci-robot commented Jul 29, 2019

Dregathar commented Sep 26, 2018 •

edited

Loading

johanneswuerbach commented Feb 27, 2019 •

edited

Loading