Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachineDeployment.spec.replicas defaulting should take into account autoscaler min/max size if defined #7293

Closed
zhouke1991 opened this issue Sep 27, 2022 · 37 comments · Fixed by #7990
Labels
triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@zhouke1991
Copy link

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
Kubernetes version v1.23.8
cluster-api version v1.1.5
capa version v1.2.0
cluster-autoscaler v1.20.0

I deployed an autoscaler deployment against a workload cluster in the mgmt cluster. When I apply a large requests pod in the workload cluster and the pod is in a pending status. In this scenario, the autoscaler will work. I found the replicas of MachineSet change to 2 but the count of ready is always 1. I also check the description of MachineSet, and the MachineSet controller repeatedly created and deleted the machine.

What did you expect to happen:
I found a workaround in this https://github.com/kubernetes-sigs/cluster-api/issues/5442#issuecomment-1190713981. He told us just unset the replicas of machineDeployments. But if I unset the replicas, it will use the default value replicas=1. I want to not only setup a cluster with replicas=3 but also make the autoscaler can work. How can I do it?

The cluster YAML like this:

spec: 
   topology
      workers:
          machineDeployments:
          - class: my-worker
            failureDomain: us-west-2a
            name: md-0
            replicas: 3

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

The action of machineset controller
21664256338_ pic

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 27, 2022
@killianmuldoon
Copy link
Contributor

A workaround for this is to follow the pattern in https://github.com/kubernetes-sigs/cluster-api/issues/link but set the replica field in the MachineDeployment, where you're setting the autoscaler annotations, to whatever initial value you'd prefer.

The issue here is that the cluster topology controller will try to manage the number of replicas if the value is set in the Cluster object. If the annotations etc. are set the autoscaler will try to do the same, so you end up with two controllers setting the same field which won't work.

If you set the replicas field in the MachineDeployment when you create your cluster, future control will be left up to the autoscaler only.

@zhouke1991
Copy link
Author

zhouke1991 commented Sep 27, 2022

A workaround for this is to follow the pattern in https://github.com/kubernetes-sigs/cluster-api/issues/link but set the replica field in the MachineDeployment, where you're setting the autoscaler annotations, to whatever initial value you'd prefer.

The issue here is that the cluster topology controller will try to manage the number of replicas if the value is set in the Cluster object. If the annotations etc. are set the autoscaler will try to do the same, so you end up with two controllers setting the same field which won't work.

If you set the replicas field in the MachineDeployment when you create your cluster, future control will be left up to the autoscaler only.

@killianmuldoon Thanks for replying. Is the above link unavailable? I got a blank page in the issues search. Do you mean that we can set replicas=3 directly to the machinedeployment when creating a cluster and setting the autoscaler annotations?

cluster.k8s.io/cluster-api-autoscaler-node-group-max-size: "6"
cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3"

Then we can apply both the cluster and machinedeployment yaml on the mgmt cluster.
So this issue can't be avoided If I provide cluster yaml only?

@killianmuldoon
Copy link
Contributor

Apologies - the correct link is: #5442 (comment)

@sbueringer
Copy link
Member

sbueringer commented Sep 27, 2022

I think almost. I think the idea was that the annotations should be set via Cluster.spec.topology...

Unfortunately that's currently not supported, but there is an issue for it and we're working on it: #7006

@killianmuldoon
Copy link
Contributor

The current sequence for working around this issue is to:

  1. Create the cluster object
  2. The Cluster topology controller will create a MachineDeployment with no autoscaler annotations and a default replica number (i.e. 1)
  3. Edit the MachineDeployment to add the annotations and the desired replica count

As @sbueringer said there is an open issue for setting the annotations through ClusterClass. I don't think that will solve the issue with setting the initial replica field, however as annotations can only manage min and max size.

@sbueringer
Copy link
Member

sbueringer commented Sep 27, 2022

I would expect that it's fine if the replica count is 1 initially as the autoscaler should upgrade to the min-size automatically (? just guessing)

@zhouke1991
Copy link
Author

zhouke1991 commented Sep 28, 2022

I think almost. I think the idea was that the annotations should be set via Cluster.spec.topology...

Unfortunately that's currently not supported, but there is an issue for it and we're working on it: #7006

@sbueringer I think I also need this feature #7006 if I create a workload cluster with autoscaler annotations(cluster.k8s.io/cluster-api-autoscaler-node-group-max-size) in machinedeployment day0. Otherwise, users have to edit the MachineDeployment to add the annotations manually.

@sbueringer
Copy link
Member

@zhouke1991

Correct. We're working on it and I would currently expect this to land in Cluster API v1.3.0.

@fabriziopandini
Copy link
Member

fabriziopandini commented Sep 29, 2022

If I got it right we can achieve the requested behavior (autoscaler controlling a MD, that should start with 3 replicas) by

  1. Setting replicas to nil MachineDeployments
  2. Adding cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3" on the machine deployment

@elmiko to confirm if you have some time

WRT to the second point, as of today it is not possible when using CC, but this is being already tracked in #7006, and it is somehow related to the ongoing discussion on label propagation that we are trying to address in the 1.3 release

If all this is confirmed, I would suggest to close this issue as duplicate and continue to work on #7006

@elmiko
Copy link
Contributor

elmiko commented Sep 29, 2022

If I got it right we can achieve the requested behavior (autoscaler controlling a MD, that should start with 3 replicas) by

1. Setting replicas to nil MachineDeployments

2. Adding cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3" on the machine deployment

hmm, this is a good scenario and we need to be careful about. if the minimum size for the node group is "3" then we need to make sure that the initial replicas is also set to "3" on the MachineDeployment. the autoscaler will not try to bring a node group to its minimum, these values are just boundaries that the autoscaler will not go beyond.

if possible, i think we need the cluster class logic to be smart enough that when it sees a nil for replicas and the autoscaler annotations are set, then it should use the minimum value as the replica count. alternatively we need to be very clear about how the replicas should be set, which seems like it could be confusing.

@sbueringer
Copy link
Member

if possible, i think we need the cluster class logic to be smart enough that when it sees a nil for replicas and the autoscaler annotations are set, then it should use the minimum value as the replica count. alternatively we need to be very clear about how the replicas should be set, which seems like it could be confusing.

That's a bit hard to do in the Cluster topology reconciler as it usually either reconciles a field always to a specific value or not. But always doesn't work as it would overwrite the auto-scaler continuously.

Q: Just saw that those labels are deprecated in the auto scaler (https://github.com/kubernetes/autoscaler/blob/e478ee29596f541bef3d783843532d4fe64d9a48/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_utils.go#L31-L32). Is there a replacement?

@fabriziopandini
Copy link
Member

fabriziopandini commented Sep 29, 2022

the autoscaler will not try to bring a node group to its minimum

I don't have context on the reason behind current behaviour, but IMO this seems like ignoring cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3". @elmiko Is there room to fix this behavior?

@elmiko
Copy link
Contributor

elmiko commented Sep 29, 2022

That's a bit hard to do in the Cluster topology reconciler as it usually either reconciles a field always to a specific value or not. But always doesn't work as it would overwrite the auto-scaler continuously.

yeah... that does sound complicated. it makes me wonder if we need to add something to that API to allow for the autoscaler? (i'm not super fond of that idea as it's very specific to one cluster addon, but i feel compelled to ask)

Q: Just saw that those labels are deprecated in the auto scaler (https://github.com/kubernetes/autoscaler/blob/e478ee29596f541bef3d783843532d4fe64d9a48/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_utils.go#L31-L32). Is there a replacement?

unfortunately this is another error on my part, when merging the scale from zero support i should have removed those constants as we use a more dynamic method to determine the annotation values. i will remove them.

I don't have context on the reason behind current behaviour, but IMO this seems like ignoring cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "3". @elmiko Is there room to fix this behavior?

this is long standing behavior for the autoscaler, i can offer a link to the autoscaler FAQ, and my thoughts on the topic:

i believe that the autoscaler is designed to work in a very simple and predictable manner, if it observes pending pods it will attempt to make more nodes, if nodes are under utilized it will remove them. creating more nodes to reach the minimum or reducing nodes when over the maximum are behaviors that the autoscaler has not historically performed. i'm guessing this would be a large change for the autoscaler users and could also have strange effects in places where people might manually adjust the size of their node groups. i am certainly willing to bring this up with the autoscaling sig, but i have a feeling they would not want to change this behavior.

@sbueringer
Copy link
Member

sbueringer commented Sep 29, 2022

unfortunately this is another error on my part, when merging the scale from zero support i should have removed those constants as we use a more dynamic method to determine the annotation values. i will remove them.

Ah thx, so the annotations are not deprecated just the local constants.

I'm not sure if it's acceptable but we could adjust the behavior of the MachineDeployment webhook. As of today it always defaults replicas to 1 if replicas is not set. We could default to the value of the min-size annotation if the annotation is present.

Would work like this:

Really not sure if that's not too hacky

@elmiko
Copy link
Contributor

elmiko commented Sep 29, 2022

Ah thx, so the annotations are not deprecated just the local constants.

yeah, and kubernetes/autoscaler#5222 =)

Would work like this:

* Cluster.spec.topology...replicas is not set, min-size annotation is set (only works after [Labels and annotations for MachineDeployments and KubeadmControlPlane created by topology controller #7006](https://github.com/kubernetes-sigs/cluster-api/issues/7006))

* Cluster topology controller creates MachineDeployment without replicas set

* MachineDeployment webhook sets spec.replicas to the value of the min-size annotation, 1 if the annotation is not set

* cluster-autoscaler takes over

Really not sure if that's not too hacky

that sounds like it would work, i'm not sure about how hacky it might be.

i guess another option might be to create explicit values in the Cluster API for the autoscaler behavior, but i'm not sure that would be any more acceptable than what you have proposed.

@fabriziopandini
Copy link
Member

fabriziopandini commented Sep 29, 2022

@elmiko thanks for the detailed explanation

I think that @sbueringer idea makes sense, the one downside is that we are embedding in CAPI something autoscaler specific while instead we should be agnostic, but I think we can make an exception in this case

@vincepri @enxebre opinions?

@elmiko
Copy link
Contributor

elmiko commented Sep 29, 2022

I think that @sbueringer idea makes sense, the one downside is that we are embedding in CAPI something autoscaler specific while instead we should be agnostic, but I think we can make an exception in this case

+1, i think the desire to make this agnostic is good especially since there might be situations in the future where people want to run different autoscalers (eg someone mentioned karpenter with cluster-api).

@sbueringer
Copy link
Member

sbueringer commented Sep 30, 2022

One way to do this is to use a different additional annotation (something that just expresses "default replicas")

@killianmuldoon
Copy link
Contributor

If we're adding an annotation would it be better to add something even more generic - e.g. external-scaling, allow users to set replicas in the ClusterClass but only set the replica count on MD creation?

That way the flow feels more natural to me i.e. replicas are set in the replica field but scaling is no longer handled by the topology controller. It feels confusing to have two places to set the replica count.

@sbueringer
Copy link
Member

sbueringer commented Sep 30, 2022

The problem is that if we only set it on create the subsequent update would remove the field again which again leads to the default value being set.

This would only work if the objects is created/modified exactly in this order:

  1. Cluster topology controller creates the MD with .spec.replicas being set (and thus owns .spec.replicas)
  2. autoscaler updates the MD and overwrites the .spec.replicas field (and thus takes ownership)
  3. Next Cluster topology controller reconcile updates the MD without setting .spec.replicas (and thus autoscaler keeps ownership)

If between 1. and 3. the autscaler doesn't take ownership 3. should clear the field which leads to the MD webhook setting 1 as default again.

@killianmuldoon
Copy link
Contributor

If between 1. and 3. the autscaler doesn't take ownership 3. should clear the field which leads to the MD webhook setting 1 as default again.

I think there's a few ways to work around that in the implementation e.g. the webhook could check for the annotation and not default if it's there. I'm wondering though what the cleanest version of the API is for autoscaling, assuming we want one way to support multiple types of external scaling.

@elmiko
Copy link
Contributor

elmiko commented Sep 30, 2022

i wonder if it make sense to have some sort of field or annotation that shows the MD is being managed by something else?

@abhijit-dev82
Copy link
Contributor

abhijit-dev82 commented Dec 7, 2022

/assign

@fabriziopandini
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 28, 2022
@sbueringer
Copy link
Member

sbueringer commented Jan 25, 2023

Summary of the comments above:

  • As soon as replicas is set in Cluster...machineDeployments[].replicas the Cluster topology controller will enforce this value on the MachineDeployment
  • The autoscaler can only take over control of the replicas field if replicas is not set in the Cluster.
  • When the Cluster topology controller creates a new MachineDeployment replicas is defaulted to 1.
  • When the replica field is unset in the Cluster, the Cluster topology controller will unset the replica field in the MachineDeployment and this will default replicas to 1 as well.
  • The autoscaler only takes over control of the replicas field if the current MD replicas value is in the (min-size,max-size) range.

Use cases:
1. A new MD is created and replicas should be managed by the autoscaler

When a new MD is created it will be created with the min and max size annotations to enable the autoscaler.

Today this doesn't always work as replicas is always defaulted to 1 and 1 might not be in the (min-size,max-size) range. (and because of that the autoscaler doesn't act)

2. An existing MD which initially wasn't controlled by the autoscaler should be later controlled by the autoscaler

A MD is created and initially the replica field is controlled by the Cluster topology reconciler (it is enforcing the replicas value from Cluster). The control of the replicas field of this MD should now be handed over to the autoscaler.

Today this doesn't work well, because when the topology reconciler unsets the replicas field it is defaulted to 1. This is disruptive if the MD currently has more than 1 replica (and also 1 might not be in the (min-size,max-size) range).

To summarize, I think we should provide a way to control to which value the replicas field is defaulted.

I would propose the following behavior for the defaulting webhook:

  • If (new) MD replicas is set => keep the current value
  • If machinedeployment.cluster.x-k8s.io/default-replicas is set => use the annotation value
  • If autoscaler min-size and max-size annotations are set:
    • If it's a new MD => use min-size
    • If old MD replicas is < min-size => use min-size
    • If old MD replicas is > max-size => use max-size
    • If old MD replicas is >= min-size && <= max-size => keep the old MD replicas value
  • Otherwise => use 1

(POC: #7990)

I think this is a good trade-off between explicitly supporting the autoscaler as good as we can by steering the replicas value towards the (min-size,max-size) range, while still allowing folks to set a specific default value with the machinedeployment.cluster.x-k8s.io/default-replicas if necessary. This annotation could be used independent of the autoscaler as well.

cc @enxebre @fabriziopandini

@fabriziopandini
Copy link
Member

fabriziopandini commented Jan 25, 2023

Thanks for the summary!
I'm +1 to the proposed solution.

While documenting this we should somehow surface that:

  • We are providing a smoother UX for clusters using the K8s autoscaler because we are planning to use it for Cluster API scale testing and the autoscaler currently doesn't manage MD with replica number out of the range defined by min/max size; machinedeployment.cluster.x-k8s.io/default-replicas provides a solution for other autoscalers if they have a similar issue.
  • The UX shortcut above (inferring replicas from min/max size) works if the K8s autoscaler is using the default cluster API node group, which we assume is the most common use case; also in this case, machinedeployment.cluster.x-k8s.io/default-replicas provides a solution for people using custom node groups names (or at least this is my suggestion for the first iteration)

/retitle MachineDeployment.spec.replicas defaulting should take into account autoscaler min/max size if defined

@k8s-ci-robot k8s-ci-robot changed the title Autoscaler can't work if set topology.workers.machineDeployments[0].replicas=1 in ClusterClass MachineDeployment.spec.replicas defaulting should take into account autoscaler min/max size if defined Jan 25, 2023
@elmiko
Copy link
Contributor

elmiko commented Jan 25, 2023

excellent summary and suggestion @sbueringer , i tend to agree with @fabriziopandini and i'm +1 as well. but, i also agree about:

  • If old MD replicas is < min-size => use min-size
  • If old MD replicas is > max-size => use max-size

We are providing a smoother UX for clusters using the K8s autoscaler because we are planning to use it for Cluster API scale testing and the autoscaler currently doesn't manage MD with replica number out of the range defined by min/max size; machinedeployment.cluster.x-k8s.io/default-replicas provides a solution for other autoscalers if they have a similar issue.

we should be careful about how we release and document this because it will be a different behavior than users currently expect from the autoscaler. but, since this behavior is originating from the cluster-api side of the tooling, i think it will be ok as long as we are clear about it. we should probably add a note to the autoscaler docs as well.

@fabriziopandini
Copy link
Member

Let's bring this up in the office hours for better visiblity

@jackfrancis
Copy link
Contributor

Does the scope of this effort include enforcing cluster-autoscaler min/max when the replicas are updated?

For example, if the active CA mix value is set to 3, and if a user manually sets the replica count to 1, (IMO), CAPI should reject that new, desired configuration. Otherwise we are just adding operational thrashing (cluster-autoscaler will observe that and then "re-enforce" the minimum by scaling back up to 3).

@elmiko does that sound right?

@sbueringer
Copy link
Member

sbueringer commented Jan 26, 2023

Does the scope of this effort include enforcing cluster-autoscaler min/max when the replicas are updated?

No.

cluster-autoscaler will observe that and then "re-enforce" the minimum by scaling back up to 3

I think it would not. If the replica field is explicitly set to something outside of the (min-size,max-size) range the autoscaler doesn't act (which is expected behavior of the autoscaler)

@sbueringer
Copy link
Member

sbueringer commented Jan 26, 2023

(from office hours) Regarding the MachinePool annotation: cluster.x-k8s.io/replicas-managed-by

It can be applied to MachinePool resources to signify that some external system is managing infrastructure scaling for that pool

Essentially the annotation hands off the reconciliation of the replicas field to an external system. MachinePools than only observe those changes.

I think this is different to what we propose here. In our case we only change the behavior of the defaulting of the replicas field. The MachineDeployment controller won't be changed and will remain reponsible for actually reconciling those replicas.

@sbueringer
Copy link
Member

sbueringer commented Jan 26, 2023

(from office hours) Regarding: "Should we do the same for MachineSets?"

I think yes.

If you read my comment above with s/MD/MS/ the reasoning makes sense as well. The only difference is that the Cluster topology reconciler is not relevant. But if for example another controller (GitOps, ...) is creating/updating MachineSets we run into the same issues.

@fabriziopandini
Copy link
Member

I think we should keep the first iteration focused on MD only, and then open an issue for MS.
The only change I will make is to make the annotation name a little bit more generic so in future we can use it both on MD and MS

@sbueringer
Copy link
Member

sbueringer commented Jan 26, 2023

I think we have two options:

  • have a specific annotation each for MD and MS
  • have a generic one which could theoretically apply to every Cluster API object

Not sure if the latter is confusing too users, but I"m fine with both options

@elmiko
Copy link
Contributor

elmiko commented Jan 26, 2023

For example, if the active CA mix value is set to 3, and if a user manually sets the replica count to 1, (IMO), CAPI should reject that new, desired configuration. Otherwise we are just adding operational thrashing (cluster-autoscaler will observe that and then "re-enforce" the minimum by scaling back up to 3).

in our case, if the min for an MD is 3 and the user sets it to 1, the cluster autoscaler will see that it is below the minimum and will not take any action on that node group.

so, if we want to add this extra assurance, then we could have the CAPI controller reject a user's request to set the replicas outside the scaling limits. this does make me pause though because it is not inline with how the cluster autoscaler works currently. what if a user needs to scale an MD down to 1 replicas for some unrelated event, they would need to drop the autoscaler annotations, or pause the autoscaler, or perhaps mark the nodes so that the autoscaler ignores them.

in theory, i like the idea of the capi controllers helping a user to keep the replicas in check, i'm just worried about changing the expected behavior and ensuring that we broadcast the change.

@sbueringer
Copy link
Member

I agree. If possible, I would like to keep potential changes in that regard out of the current discussion and treat it as a separate issue/discussion.

@jackfrancis
Copy link
Contributor

Thanks for the explanation, I was naively describing cluster-autoscaleer behaviors that are not actually how cluster-autoscaler behaves!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
8 participants