-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler: Move DisableScaleDownForLoop to the scaleUp loop #3547
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Welcome @nyodas! |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
cla signed. |
@@ -432,6 +432,7 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) errors.AutoscalerError | |||
a.lastScaleUpTime = currentTime | |||
// No scale down in this iteration. | |||
scaleDownStatus.Result = status.ScaleDownInCooldown | |||
a.processorCallbacks.DisableScaleDownForLoop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call does nothing. The return one line below already means that CA will not run scale-down logic (it will exit the function before getting to it).
The intent of this DisableScaleDownForLoop call is not to prevent scale-down in the loop where scale-up happened (this already happens by returning from function early), but to make CA wait until all pods that can possibly be scheduled are scheduled before attempting any scale-down.
This is (was?) needed because scale-down logic doesn't currently take into account pending pods. Consider an example:
- node A is not running any pods
- there are pending pods x and y that could schedule on A, but haven't been scheduled yet
Unfortunately scale-down logic isn't aware of pods x and y and it could delete node A. Next loop a scale-up will be triggered for x and y that would create node A', which would soon be deleted and so forth. This isn't much of a problem if you use default 10 minute scale-down-unneeded-time (as the pods are likely to schedule in those 10 minutes), but it may be a problem for more aggressive scale-down configurations.
I'm certain the above is true in 1.17 and older. I suspect the issue may have been fixed as a side effect of scheduler framework migration in 1.18 (details: filter_out_schedulable applies its 'scheduling' to cluster snapshot, which may implicitly make scale-down logic consider those pods?). If that's true and the underlying problem was fixed maybe we can just remove this DisableScaleDownForLoop completely? However, I'd like to see some convincing analysis + testing result that show that it is indeed safe to do so before accepting the change.
A technical side note: I don't think that hitting max-nodes would block the scale-down by itself - only pods that can schedule on existing nodes should block it. If all the pending pods can't be scheduled and you can't scale-up because of max-nodes you should still be able to scale-down (if no pods can schedule no pods will be removed in filter_out_schedulable and so DisableScaleDownForLoop won't be called).
What I think is happening is that you probably have a lot of pods that really are unschedulable and scheduler tries to schedule those over and over which starves the few pods that are actually schedulable. If new pods keep being created (as seem to be the case in your scenario) there may always be some pods that are technically schedulable and blocking scale-down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call does nothing. The return one line below already means that CA will not run scale-down logic (it will exit the function before getting to it).
Indeed i totally missed the return bellow 🤦
Unfortunately scale-down logic isn't aware of pods x and y and it could delete node A. Next loop a scale-up will be triggered for x and y that would create node A'
Should this be the case if we have a way to get the list of pods that received a node recently ?
However, I'd like to see some convincing analysis + testing result that show that it is indeed safe to do so before accepting the change.
I will check that out fully, my understanding was based on the previous implementation of the scheduling.
For example there used to be a way to detect pod that had an upscaled node upcoming.
Will write up my finding if any.
What I think is happening is that you probably have a lot of pods that really are unschedulable and scheduler tries to schedule those over and over which starves the few pods that are actually schedulable. If new pods keep being created (as seem to be the case in your scenario) there may always be some pods that are technically schedulable and blocking scale-down.
We've had a few issue with exactly this. But no nodes were able to actually host those pods.
Which kind of put us in a position were we had a scheduling pressure from existing pending pod and additional new pod not finding a home that would normally trigger a scale up with a cluster full and no scale down in sight.
This can also be found when having a "scale down" only autoscaler. ( Low max nodes)
I will try to get some graph illustrating this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the case if we have a way to get the list of pods that received a node recently ?
What I mean is a scenario where the pod haven't been scheduled yet. A sufficiently large number of "really unschedulable" pods can starve "previously unschedulable, but now schedulable pods" in scheduler queue. So CA may delete a node that is needed before scheduler puts stuff on it.
For example there used to be a way to detect pod that had an upscaled node upcoming.
There was and there still is: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L382. However, that logic was always linked to scale-up. Scale-down logic isn't aware of ongoing scale-up and the way to handle that was always to block scale-down until all scale-up activity has finished (I'm not going to claim this is a great way to handle this, but the fact is that right now scale-down implementation is not nearly as robust as scale-up).
We've had a few issue with exactly this. But no nodes were able to actually host those pods.
Interesting. Can you check CA logs for Pod %s.%s marked as unschedulable can be scheduled on node %s" lines?
and verify those pods really can't fit on those nodes? Note that this includes a scenario where a pod could fit on upcoming node (ie. one that is already starting up).
I would expect that the scale-down should eventually happen with current logic if the cluster remained static for long enough (I know this doesn't really help your case). If this is not the case and CA is actually wrong when it thinks those pending pods should be schedulable that would be a much more serious issue - CA disagreeing with scheduler can prevent scale-up or lead to incorrect scale-up, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I'm saying is that scale-down logic was not implemented to handle cluster with constant churn. The assumption was that scale-down will only happen after all scale-up and scheduling is done and both the code and all tests were written with that implicit assumption. I think that a side effect of migrating to scheduler framework in 1.18 may be that scale-down can now better handle this, but I don't want to remove the protection of DisableScaleDownForLoop without some proof that this is indeed the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Can you check CA logs for Pod %s.%s marked as unschedulable can be scheduled on node %s" lines? and verify those pods really can't fit on those nodes? Note that this includes a scenario where a pod could fit on upcoming node (ie. one that is already starting up).
Couldn't find those logs going back a month unfortunately.
I would expect that the scale-down should eventually happen with current logic if the cluster remained static for long enough (I know this doesn't really help your case)
Tried to reproduce on a smaller less churny cluster, though with a constant stream of pending pods to try to simulate the churn we're usually experiencing.
And it is working as expected.( Without removing the DisableScaleDownForLoop )
There is a DisableScaleDown but after a little while things start to scale down properly.
I will start testing on a bigger cluster to see if the situation over there is still the same.While forcing the max-nodes.
That patch is actually a forward port of a previous custom version pre-scheduler 1.18 that we patched.
So on the flip side of what you said it might actually not be necessary anymore.(this patch)
I will test that further with the latest master on a 3000 nodes to be certain that it work as expected.
CA is actually wrong when it thinks those pending pods should be schedulable that would be a much more serious issue - CA disagreeing with scheduler can prevent scale-up or lead to incorrect scale-up, etc.
Considering that we have a few modification of our own to deal with local-volume and other things.
That could be what actually happen.
I don't want to remove the protection of DisableScaleDownForLoop without some proof that this is indeed the case.
Totally understandable, and very sorry for wasting your time , if i can't find it.
It was however very instructive on how scale up and down are working , so thank you very much for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. I've seen the same problem that you have a few times already (with a few different causes), so I know it is a very real issue. Ideally we should just rewrite scale-down logic to be able to deal with all those cases, but I don't think it's likely to happen anytime soon.
When running the autoscaler with `--max-nodes-total` set and that max is reached downscale is disabled when new unschedulable pods are added. Preventing the removal of nodes. If pods are created very frequently by let's say a cronjob, the cluster will never be able to downscale. In order to fix that behavior i moved the downscale lock from the process to the actual place where scale-up happens.
0cb27e7
to
e87be3e
Compare
I haven't been able to reproduce. Thanks for all the help and a new understanding of the intricacies of it all. 🙇 |
When running the autoscaler with
--max-nodes-total
set and that max is reached, downscale is disabled when new unschedulable pods are added.Preventing the removal of nodes.
If pods are created very frequently by let's say by a cronjob/job, the cluster will never be able to downscale.
Despite the fact that you could have a certain amount of unneeded node that could be removed to make some space for needed nodes.
In order to fix that behavior i moved the downscale lock from the process to the actual place where scale-up happens.