Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop waiting for upcoming nodes from unhealthy node groups #1980

Conversation

mvisonneau
Copy link

@mvisonneau mvisonneau commented May 2, 2019

This is an attempt to solve a long going issue, preventing CA from failing over healthy node groups when a scale-up occurred onto a non-healthy one. This is particularly painful whilst working with spot based ASGs which are very prone to these kinds of disruptions.

References in #1795 and #1133

if people want to easily try it out, I published a docker release: docker.io/mvisonneau/cluster-autoscaler:1.14.2-fix_scale_up_unavail_node_groups

it seems to work perfectly fine for my use case :

Events:
  Type     Reason            Age                    From                Message
  ----     ------            ----                   ----                -------
  Normal   TriggeredScaleUp  52m (x7 over 54m)      cluster-autoscaler  pod triggered scale-up: [{sb1-in-k8s-worker-spot-r5.2xlarge-b 0->1 (max: 20)}]
  Warning  FailedScheduling  4m10s (x195 over 60m)  default-scheduler   0/9 nodes are available: 6 Insufficient cpu, 9 Insufficient memory.
  Normal   TriggeredScaleUp  52s                    cluster-autoscaler  pod triggered scale-up: [{sb1-in-k8s-worker-spot-m5.4xlarge-b 0->1 (max: 20)}]

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 2, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: aleksandra-malinowska

If they are not already assigned, you can assign the PR to them by writing /assign @aleksandra-malinowska in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MaciekPytel
Copy link
Contributor

I don't think this solves the issue. A single failed scale-up won't make a NodeGroup unhealthy, it will likely take multiple retries before it gets there. And just because a NodeGroup is unhealthy doesn't mean a node that is already being created will fail.

There is already plenty of logic to deal with this scenario. I think the problem here is that AWS cloudprovider implementation doesn't use it. I've put some details in #1996 (comment).

@mvisonneau
Copy link
Author

Indeed @MaciekPytel, I did eventually end up with this issue :( I'll try to have a closer look following your comments.

@mvisonneau mvisonneau closed this May 8, 2019
@Jeffwan
Copy link
Contributor

Jeffwan commented May 10, 2019

There is already plenty of logic to deal with this scenario. I think the problem here is that AWS cloudprovider implementation doesn't use it. I've put some details in #1996 (comment).

Hi @MaciekPytel Anywhere I can check for the logic you mentioned? I'd love to add to AWS cloudprovider side.

@MaciekPytel
Copy link
Contributor

The implementation is already in progress in #2008 - let's continue discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants