-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MachineDeployment.spec.replicas defaulting should take into account autoscaler min/max size if defined #7293
Comments
A workaround for this is to follow the pattern in https://github.com/kubernetes-sigs/cluster-api/issues/link but set the replica field in the MachineDeployment, where you're setting the autoscaler annotations, to whatever initial value you'd prefer. The issue here is that the cluster topology controller will try to manage the number of replicas if the value is set in the Cluster object. If the annotations etc. are set the autoscaler will try to do the same, so you end up with two controllers setting the same field which won't work. If you set the replicas field in the MachineDeployment when you create your cluster, future control will be left up to the autoscaler only. |
@killianmuldoon Thanks for replying. Is the above link unavailable? I got a blank page in the issues search. Do you mean that we can set replicas=3 directly to the machinedeployment when creating a cluster and setting the autoscaler annotations?
Then we can apply both the cluster and machinedeployment yaml on the mgmt cluster. |
Apologies - the correct link is: #5442 (comment) |
I think almost. I think the idea was that the annotations should be set via Cluster.spec.topology... Unfortunately that's currently not supported, but there is an issue for it and we're working on it: #7006 |
The current sequence for working around this issue is to:
As @sbueringer said there is an open issue for setting the annotations through ClusterClass. I don't think that will solve the issue with setting the initial replica field, however as annotations can only manage min and max size. |
I would expect that it's fine if the replica count is 1 initially as the autoscaler should upgrade to the min-size automatically (? just guessing) |
@sbueringer I think I also need this feature #7006 if I create a workload cluster with autoscaler annotations(cluster.k8s.io/cluster-api-autoscaler-node-group-max-size) in machinedeployment day0. Otherwise, users have to edit the MachineDeployment to add the annotations manually. |
Correct. We're working on it and I would currently expect this to land in Cluster API v1.3.0. |
If I got it right we can achieve the requested behavior (autoscaler controlling a MD, that should start with 3 replicas) by
@elmiko to confirm if you have some time WRT to the second point, as of today it is not possible when using CC, but this is being already tracked in #7006, and it is somehow related to the ongoing discussion on label propagation that we are trying to address in the 1.3 release If all this is confirmed, I would suggest to close this issue as duplicate and continue to work on #7006 |
hmm, this is a good scenario and we need to be careful about. if the minimum size for the node group is "3" then we need to make sure that the initial replicas is also set to "3" on the MachineDeployment. the autoscaler will not try to bring a node group to its minimum, these values are just boundaries that the autoscaler will not go beyond. if possible, i think we need the cluster class logic to be smart enough that when it sees a |
That's a bit hard to do in the Cluster topology reconciler as it usually either reconciles a field always to a specific value or not. But always doesn't work as it would overwrite the auto-scaler continuously. Q: Just saw that those labels are deprecated in the auto scaler (https://github.com/kubernetes/autoscaler/blob/e478ee29596f541bef3d783843532d4fe64d9a48/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_utils.go#L31-L32). Is there a replacement? |
I don't have context on the reason behind current behaviour, but IMO this seems like ignoring |
yeah... that does sound complicated. it makes me wonder if we need to add something to that API to allow for the autoscaler? (i'm not super fond of that idea as it's very specific to one cluster addon, but i feel compelled to ask)
unfortunately this is another error on my part, when merging the scale from zero support i should have removed those constants as we use a more dynamic method to determine the annotation values. i will remove them.
this is long standing behavior for the autoscaler, i can offer a link to the autoscaler FAQ, and my thoughts on the topic: i believe that the autoscaler is designed to work in a very simple and predictable manner, if it observes pending pods it will attempt to make more nodes, if nodes are under utilized it will remove them. creating more nodes to reach the minimum or reducing nodes when over the maximum are behaviors that the autoscaler has not historically performed. i'm guessing this would be a large change for the autoscaler users and could also have strange effects in places where people might manually adjust the size of their node groups. i am certainly willing to bring this up with the autoscaling sig, but i have a feeling they would not want to change this behavior. |
Ah thx, so the annotations are not deprecated just the local constants. I'm not sure if it's acceptable but we could adjust the behavior of the MachineDeployment webhook. As of today it always defaults replicas to 1 if replicas is not set. We could default to the value of the min-size annotation if the annotation is present. Would work like this:
Really not sure if that's not too hacky |
yeah, and kubernetes/autoscaler#5222 =)
that sounds like it would work, i'm not sure about how hacky it might be. i guess another option might be to create explicit values in the Cluster API for the autoscaler behavior, but i'm not sure that would be any more acceptable than what you have proposed. |
@elmiko thanks for the detailed explanation I think that @sbueringer idea makes sense, the one downside is that we are embedding in CAPI something autoscaler specific while instead we should be agnostic, but I think we can make an exception in this case |
+1, i think the desire to make this agnostic is good especially since there might be situations in the future where people want to run different autoscalers (eg someone mentioned karpenter with cluster-api). |
One way to do this is to use a different additional annotation (something that just expresses "default replicas") |
If we're adding an annotation would it be better to add something even more generic - e.g. That way the flow feels more natural to me i.e. replicas are set in the replica field but scaling is no longer handled by the topology controller. It feels confusing to have two places to set the replica count. |
The problem is that if we only set it on create the subsequent update would remove the field again which again leads to the default value being set. This would only work if the objects is created/modified exactly in this order:
If between 1. and 3. the autscaler doesn't take ownership 3. should clear the field which leads to the MD webhook setting |
I think there's a few ways to work around that in the implementation e.g. the webhook could check for the annotation and not default if it's there. I'm wondering though what the cleanest version of the API is for autoscaling, assuming we want one way to support multiple types of external scaling. |
i wonder if it make sense to have some sort of field or annotation that shows the MD is being managed by something else? |
/assign |
/triage accepted |
Summary of the comments above:
Use cases: When a new MD is created it will be created with the min and max size annotations to enable the autoscaler. Today this doesn't always work as replicas is always defaulted to 1 and 1 might not be in the (min-size,max-size) range. (and because of that the autoscaler doesn't act) 2. An existing MD which initially wasn't controlled by the autoscaler should be later controlled by the autoscaler A MD is created and initially the replica field is controlled by the Cluster topology reconciler (it is enforcing the replicas value from Cluster). The control of the replicas field of this MD should now be handed over to the autoscaler. Today this doesn't work well, because when the topology reconciler unsets the replicas field it is defaulted to 1. This is disruptive if the MD currently has more than 1 replica (and also 1 might not be in the (min-size,max-size) range). To summarize, I think we should provide a way to control to which value the replicas field is defaulted. I would propose the following behavior for the defaulting webhook:
(POC: #7990) I think this is a good trade-off between explicitly supporting the autoscaler as good as we can by steering the replicas value towards the (min-size,max-size) range, while still allowing folks to set a specific default value with the |
Thanks for the summary! While documenting this we should somehow surface that:
/retitle MachineDeployment.spec.replicas defaulting should take into account autoscaler min/max size if defined |
excellent summary and suggestion @sbueringer , i tend to agree with @fabriziopandini and i'm +1 as well. but, i also agree about:
we should be careful about how we release and document this because it will be a different behavior than users currently expect from the autoscaler. but, since this behavior is originating from the cluster-api side of the tooling, i think it will be ok as long as we are clear about it. we should probably add a note to the autoscaler docs as well. |
Let's bring this up in the office hours for better visiblity |
Does the scope of this effort include enforcing cluster-autoscaler min/max when the replicas are updated? For example, if the active CA mix value is set to 3, and if a user manually sets the replica count to 1, (IMO), CAPI should reject that new, desired configuration. Otherwise we are just adding operational thrashing (cluster-autoscaler will observe that and then "re-enforce" the minimum by scaling back up to 3). @elmiko does that sound right? |
No.
I think it would not. If the replica field is explicitly set to something outside of the (min-size,max-size) range the autoscaler doesn't act (which is expected behavior of the autoscaler) |
(from office hours) Regarding the MachinePool annotation:
Essentially the annotation hands off the reconciliation of the replicas field to an external system. MachinePools than only observe those changes. I think this is different to what we propose here. In our case we only change the behavior of the defaulting of the replicas field. The MachineDeployment controller won't be changed and will remain reponsible for actually reconciling those replicas. |
(from office hours) Regarding: "Should we do the same for MachineSets?" I think yes. If you read my comment above with s/MD/MS/ the reasoning makes sense as well. The only difference is that the Cluster topology reconciler is not relevant. But if for example another controller (GitOps, ...) is creating/updating MachineSets we run into the same issues. |
I think we should keep the first iteration focused on MD only, and then open an issue for MS. |
I think we have two options:
Not sure if the latter is confusing too users, but I"m fine with both options |
in our case, if the min for an MD is 3 and the user sets it to 1, the cluster autoscaler will see that it is below the minimum and will not take any action on that node group. so, if we want to add this extra assurance, then we could have the CAPI controller reject a user's request to set the replicas outside the scaling limits. this does make me pause though because it is not inline with how the cluster autoscaler works currently. what if a user needs to scale an MD down to 1 replicas for some unrelated event, they would need to drop the autoscaler annotations, or pause the autoscaler, or perhaps mark the nodes so that the autoscaler ignores them. in theory, i like the idea of the capi controllers helping a user to keep the replicas in check, i'm just worried about changing the expected behavior and ensuring that we broadcast the change. |
I agree. If possible, I would like to keep potential changes in that regard out of the current discussion and treat it as a separate issue/discussion. |
Thanks for the explanation, I was naively describing cluster-autoscaleer behaviors that are not actually how cluster-autoscaler behaves! |
What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
Kubernetes version v1.23.8
cluster-api version v1.1.5
capa version v1.2.0
cluster-autoscaler v1.20.0
I deployed an autoscaler deployment against a workload cluster in the mgmt cluster. When I apply a large requests pod in the workload cluster and the pod is in a pending status. In this scenario, the autoscaler will work. I found the replicas of MachineSet change to 2 but the count of ready is always 1. I also check the description of MachineSet, and the MachineSet controller repeatedly created and deleted the machine.
What did you expect to happen:
I found a workaround in this https://github.com/kubernetes-sigs/cluster-api/issues/5442#issuecomment-1190713981. He told us just unset the replicas of machineDeployments. But if I unset the replicas, it will use the default value replicas=1. I want to not only setup a cluster with replicas=3 but also make the autoscaler can work. How can I do it?
The cluster YAML like this:
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
The action of machineset controller
The text was updated successfully, but these errors were encountered: