Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler: Move DisableScaleDownForLoop to the scaleUp loop #3547

Closed
wants to merge 1 commit into from

Conversation

nyodas
Copy link

@nyodas nyodas commented Sep 28, 2020

When running the autoscaler with --max-nodes-total set and that max is reached, downscale is disabled when new unschedulable pods are added.

Preventing the removal of nodes.
If pods are created very frequently by let's say by a cronjob/job, the cluster will never be able to downscale.
Despite the fact that you could have a certain amount of unneeded node that could be removed to make some space for needed nodes.

In order to fix that behavior i moved the downscale lock from the process to the actual place where scale-up happens.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
  • Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Sep 28, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @nyodas!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 28, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign feiskyer
You can assign the PR to them by writing /assign @feiskyer in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nyodas
Copy link
Author

nyodas commented Sep 28, 2020

cla signed.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 28, 2020
@@ -432,6 +432,7 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) errors.AutoscalerError
a.lastScaleUpTime = currentTime
// No scale down in this iteration.
scaleDownStatus.Result = status.ScaleDownInCooldown
a.processorCallbacks.DisableScaleDownForLoop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call does nothing. The return one line below already means that CA will not run scale-down logic (it will exit the function before getting to it).

The intent of this DisableScaleDownForLoop call is not to prevent scale-down in the loop where scale-up happened (this already happens by returning from function early), but to make CA wait until all pods that can possibly be scheduled are scheduled before attempting any scale-down.

This is (was?) needed because scale-down logic doesn't currently take into account pending pods. Consider an example:

  • node A is not running any pods
  • there are pending pods x and y that could schedule on A, but haven't been scheduled yet

Unfortunately scale-down logic isn't aware of pods x and y and it could delete node A. Next loop a scale-up will be triggered for x and y that would create node A', which would soon be deleted and so forth. This isn't much of a problem if you use default 10 minute scale-down-unneeded-time (as the pods are likely to schedule in those 10 minutes), but it may be a problem for more aggressive scale-down configurations.

I'm certain the above is true in 1.17 and older. I suspect the issue may have been fixed as a side effect of scheduler framework migration in 1.18 (details: filter_out_schedulable applies its 'scheduling' to cluster snapshot, which may implicitly make scale-down logic consider those pods?). If that's true and the underlying problem was fixed maybe we can just remove this DisableScaleDownForLoop completely? However, I'd like to see some convincing analysis + testing result that show that it is indeed safe to do so before accepting the change.


A technical side note: I don't think that hitting max-nodes would block the scale-down by itself - only pods that can schedule on existing nodes should block it. If all the pending pods can't be scheduled and you can't scale-up because of max-nodes you should still be able to scale-down (if no pods can schedule no pods will be removed in filter_out_schedulable and so DisableScaleDownForLoop won't be called).

What I think is happening is that you probably have a lot of pods that really are unschedulable and scheduler tries to schedule those over and over which starves the few pods that are actually schedulable. If new pods keep being created (as seem to be the case in your scenario) there may always be some pods that are technically schedulable and blocking scale-down.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call does nothing. The return one line below already means that CA will not run scale-down logic (it will exit the function before getting to it).

Indeed i totally missed the return bellow 🤦

Unfortunately scale-down logic isn't aware of pods x and y and it could delete node A. Next loop a scale-up will be triggered for x and y that would create node A'

Should this be the case if we have a way to get the list of pods that received a node recently ?

However, I'd like to see some convincing analysis + testing result that show that it is indeed safe to do so before accepting the change.

I will check that out fully, my understanding was based on the previous implementation of the scheduling.
For example there used to be a way to detect pod that had an upscaled node upcoming.
Will write up my finding if any.

What I think is happening is that you probably have a lot of pods that really are unschedulable and scheduler tries to schedule those over and over which starves the few pods that are actually schedulable. If new pods keep being created (as seem to be the case in your scenario) there may always be some pods that are technically schedulable and blocking scale-down.

We've had a few issue with exactly this. But no nodes were able to actually host those pods.
Which kind of put us in a position were we had a scheduling pressure from existing pending pod and additional new pod not finding a home that would normally trigger a scale up with a cluster full and no scale down in sight.

This can also be found when having a "scale down" only autoscaler. ( Low max nodes)

I will try to get some graph illustrating this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the case if we have a way to get the list of pods that received a node recently ?

What I mean is a scenario where the pod haven't been scheduled yet. A sufficiently large number of "really unschedulable" pods can starve "previously unschedulable, but now schedulable pods" in scheduler queue. So CA may delete a node that is needed before scheduler puts stuff on it.

For example there used to be a way to detect pod that had an upscaled node upcoming.

There was and there still is: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L382. However, that logic was always linked to scale-up. Scale-down logic isn't aware of ongoing scale-up and the way to handle that was always to block scale-down until all scale-up activity has finished (I'm not going to claim this is a great way to handle this, but the fact is that right now scale-down implementation is not nearly as robust as scale-up).

We've had a few issue with exactly this. But no nodes were able to actually host those pods.

Interesting. Can you check CA logs for Pod %s.%s marked as unschedulable can be scheduled on node %s" lines? and verify those pods really can't fit on those nodes? Note that this includes a scenario where a pod could fit on upcoming node (ie. one that is already starting up).

I would expect that the scale-down should eventually happen with current logic if the cluster remained static for long enough (I know this doesn't really help your case). If this is not the case and CA is actually wrong when it thinks those pending pods should be schedulable that would be a much more serious issue - CA disagreeing with scheduler can prevent scale-up or lead to incorrect scale-up, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I'm saying is that scale-down logic was not implemented to handle cluster with constant churn. The assumption was that scale-down will only happen after all scale-up and scheduling is done and both the code and all tests were written with that implicit assumption. I think that a side effect of migrating to scheduler framework in 1.18 may be that scale-down can now better handle this, but I don't want to remove the protection of DisableScaleDownForLoop without some proof that this is indeed the case.

Copy link
Author

@nyodas nyodas Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Can you check CA logs for Pod %s.%s marked as unschedulable can be scheduled on node %s" lines? and verify those pods really can't fit on those nodes? Note that this includes a scenario where a pod could fit on upcoming node (ie. one that is already starting up).

Couldn't find those logs going back a month unfortunately.

I would expect that the scale-down should eventually happen with current logic if the cluster remained static for long enough (I know this doesn't really help your case)

Tried to reproduce on a smaller less churny cluster, though with a constant stream of pending pods to try to simulate the churn we're usually experiencing.
And it is working as expected.( Without removing the DisableScaleDownForLoop )
There is a DisableScaleDown but after a little while things start to scale down properly.

I will start testing on a bigger cluster to see if the situation over there is still the same.While forcing the max-nodes.

That patch is actually a forward port of a previous custom version pre-scheduler 1.18 that we patched.
So on the flip side of what you said it might actually not be necessary anymore.(this patch)

I will test that further with the latest master on a 3000 nodes to be certain that it work as expected.

CA is actually wrong when it thinks those pending pods should be schedulable that would be a much more serious issue - CA disagreeing with scheduler can prevent scale-up or lead to incorrect scale-up, etc.

Considering that we have a few modification of our own to deal with local-volume and other things.
That could be what actually happen.

I don't want to remove the protection of DisableScaleDownForLoop without some proof that this is indeed the case.

Totally understandable, and very sorry for wasting your time , if i can't find it.
It was however very instructive on how scale up and down are working , so thank you very much for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I've seen the same problem that you have a few times already (with a few different causes), so I know it is a very real issue. Ideally we should just rewrite scale-down logic to be able to deal with all those cases, but I don't think it's likely to happen anytime soon.

When running the autoscaler with `--max-nodes-total` set and that max is reached downscale is disabled when new unschedulable pods are added.
Preventing the removal of nodes.
If pods are created very frequently by let's say a cronjob, the cluster will never be able to downscale.
In order to fix that behavior i moved the downscale lock from the process to the actual place where scale-up happens.
@nyodas nyodas force-pushed the bob/downscale-when-full branch from 0cb27e7 to e87be3e Compare September 30, 2020 13:10
@nyodas
Copy link
Author

nyodas commented Oct 9, 2020

I haven't been able to reproduce.
I'm closing this.
Will reopen if we see something new.

Thanks for all the help and a new understanding of the intricacies of it all. 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants