-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 Add cluster autoscaler scale from zero ux proposal #2530
📖 Add cluster autoscaler scale from zero ux proposal #2530
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: michaelgugino The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
subscribing to this, thanks @michaelgugino ! |
Well defined API. | ||
|
||
#### Cons | ||
Users can't easily update this field if they want to set values themselves for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just that kubectl makes it hard to update status? Personally I'd rather tackle that (e.g. kubectl plugin) vs working around it with the override, if so!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it was just a matter of updating status, it'd be one thing. The other issue is that status is meant to be "recreatable" and if the resources are backed up/restored or migrated from one cluster to another the status will be lost.
like that type of thing. | ||
|
||
### Spec Field | ||
We can add fields to the spec of a particular object to set these items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I note the clever ambiguity in "particular object" - were you thinking of the MachineSet/MachineDeployment? We could also do a "sidecar" object, similar to how we did the infrastructureRef
. We could then also put additional autoscaling controls here e.g.
kind: MachineDeploymentAutoscaling
spec:
maxReplicas: 10
minReplicas: 0
scaleDownDelay: 10m
scaleUpDelay: 0s
status:
nodeShape:
memory: 1Gi
gpus: 1
labels:
foo: bar
I'm not sure where we're currently putting minReplicas / maxReplicas? Also it's a little awkward because I would imagine I'd want to set these on the MachineDeployment level, but I'd imagine node-attributes should technically be on the MachineSet level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One out of the box idea I had that seems related to this one was removing the integration between the cluster autoscaler and the cluster-api bits.
We could come up with a new 'external scale group' object. The autoscaler increments or decrements that object, and something on the cluster-api watches that object and takes the appropriate actions. This would allow people to build their own automation after the recommendations of the autoscaler, but it's not entirely clear what this should look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minReplicas / maxReplicas
are known annotations which apply to any scalable resource i.e machineSet/machineDeployment which makes them discoverable for the cluster autoscaler. They could be owned by e.g a machineAutoscaler
resource:
#2369
cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size
cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size
apiVersion: "cluster.x-k8s.io"
kind: "MachineAutoscaler"
metadata:
name: "worker-us-east-1a"
namespace: "cluster-api"
spec:
minReplicas: 1
maxReplicas: 12
scaleTargetRef:
apiVersion: cluster.x-k8s.io
kind: MachineSet
name: worker-us-east-1a
I would imagine I'd want to set these on the MachineDeployment level,
I would expect this info to be accessible through any scalable resource liable to be autoscaled but decoupled from it. This is purely provider specific compute info which does not necessarily needs a replicas controller on top to be meaningful. In this line I'd see the current providerMachineTemplates
or something like a machineClass
good candidates to infer this information as they see fit for each specific provider and expose it in a common way.
We have to build something that copies these values into the staus. This seems | ||
like it could be an easily shared piece of library code. | ||
|
||
Multiple controllers updating the same field on the status object, need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid the potential for conflicting controllers attempting to manage the same field in a resource's Status field.
We also need to keep in mind that the controller for the resource "owns" the status of that resource. Ideally we'd be able to copy these values from a resource that is owned by the controller that is generating them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not really any place to copy the information from. Unless we create some sort of new 'scale record' or some crazy thing like that, and then that's yet-another-crd.
# Cluster Autoscaler Scale from Zero UX | ||
|
||
## Summary | ||
Cluster Autoscaler supports scale from zero for cloud-native providers. This is done via compiling CPU, GPU, and RAM info into the cluster autoscaler at build time for each supported cloud. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done via compiling CPU, GPU, and RAM info into the cluster autoscaler at build time for each supported cloud.
I'd say this is actually a provider implementation choice on how they decide to implement the TemplateNodeInfo
interface, e.g gce does it dynamically https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/gce/gce_cloud_provider.go#L332
https://github.com/kubernetes/autoscaler/blob/589cd2af9c25088ce3408eeb88ba9239201d005d/cluster-autoscaler/cloudprovider/gce/gce_manager.go#L491-L504
To synthesize my comment here #2530 (comment) To satisfy 1 and 2 -> It should live along with the other provider specific compute info, e.g providerMachineTemplates or something like a machineClass. |
We could always add a Status to the InfraMachineTemplate contract and require infrastructure providers to populate the Status with the appropriate information (possibly opt in, if they want to support autoscaler integration) |
I don't know if it needs to be a require but it should be a should. The next question is, where is the appropriate place to specify override values, or otherwise populate values that are present. If we utilize annotations or even a spec field on the new object @enxebre suggests, we can sync values from the spec into the status of that new scaler object, and also sync the status fields from the infra-template. This would allow people that haven't implemented the logic or updated status fields in the infra-template to take advantage of the scale-from-zero by annotation or updating the spec on the scaler object. |
I think that makes sense (with my preference being towards using a Spec field rather than an annotation). |
## Summary | ||
Cluster Autoscaler supports scale from zero for cloud-native providers. This is done via compiling CPU, GPU, and RAM info into the cluster autoscaler at build time for each supported cloud. | ||
|
||
We are attempting to add cluster-api support to the cluster-autoscaler to scale machinesets, machinedeployments, and/or other groupings of machines as appropriate. We are trying to keep an abstraction which is isolated from particular cloud implementations. We need a way to inform the cluster-autoscaler of the value of machine resources so it can make proper scheduling decisions when there are not currently any machines in a particular scale set (machineset/machinedeployment, etc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
machinepools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's covered under 'and/or other groupings of machines,' although, machinepools don't necessarily group machines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Machinepools is it's own type of "machine", it doesn't group individual machines. I thought it would be good to reference it to have it in mind in the design since some of the concepts used for machines below might not apply the same way.
|
||
We will need input from the autoscaler team, as well as input from api reviewers to ensure we obtain the best solution. | ||
|
||
These ideas require a particular (possibly new) cluster-api cloud provider controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These ideas require a particular (possibly new) cluster-api cloud provider controller | |
These ideas require a particular (possibly new) cluster-api infrastructure provider controller |
|
||
#### Cons | ||
We're essentially creating an API around an annotation, and the api folks don't | ||
like that type of thing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we elaborate on the downside of doing that? Not sure "the api folks don't
like that type of thing" counts :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm think they don't like to have specific annotations be part of a defined api. We might be able to do just annotations if this was all for a single controller, but with an external controller looking at our object(s), it's unlikely to pass API review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issues with annotations is that they are an informal api that is detached from the actual api, so enforcing forward/backward compatibility, migrations for support, etc are quite difficult.
For example, the cluster-autoscaler currently uses the deprecated cluster.k8s.io/delete-machine
annotation rather than cluster.x-k8s.io/delete-machine
and in order to not break the integration we need to handle the backwards/forwards compatibility both in Cluster API and also in cluster-autoscaler: kubernetes/autoscaler#3161.
If the integration was done using an API field, then conversion webhooks would help ease the migration.
### Goals | ||
|
||
### Non-Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this is only because I haven't been following prior conversations about this (sorry!), but I think I would be better able to understand the proposal if the motivation section included some goals and non-goals.
For example, is it a goal to minimize (or not have any?) impact on any cluster API provider which does not support autoscale from zero? Is it a goal to minimize impact/knowledge burden for cluster API users who are not using this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to support scale from zero in an automated way. We'll be relying on providers to provide the attributes (CPU, RAM, GPU) for their respective instance types. We also want to build a mechanism to support scale from zero on cloud providers that don't have common billing (such as an openstack cloud or some other abstraction) by providing a field (either spec or annotation on some object) that allows an administrator to define the attributes themselves.
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Clean separation of discovered values and user intent. | ||
|
||
#### Cons | ||
We have to build something that copies these values into the staus. This seems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to build something that copies these values into the staus. This seems | |
We have to build something that copies these values into the status. This seems |
@michaelgugino is this proposal still active/up to date? |
I seem to recall we had a discussion where we talked about an infra provider providing the CPU/RAM/GPU information somewhere (InfraMachineTemplate?), having that information percolate up to the MachineSet/MachineDeployment, and then the autoscaler could look at that information to make its scaling decisions. Am I remembering that anywhere near correctly? 😄 |
There's some oblique mention of a solution in this discussion with @elmiko: kubernetes/autoscaler#3150 (comment). |
yes, i believe these details might have come from the discussion about how we have implemented scale from zero for openshift. we use annotations on MachineSets to percolate the details back to the autoscaler. at the minimum we need a way to communicate the cpu/ram/gpu as well as a way for taints to be consumed from the Machine[Set|Deployment], or similar method. |
@elmiko what taints are you referring to here? |
given the conversation we had today in the capi meeting and re-reading this proposal again, i like the option of using the status field with an override. we could start with an implementation of the status updates for the core information and then add an override capability. for the override, i don't yet have a strong opinion but i think if we used a field in the spec (of Machine[Set|Deployment]) then we could also give a good place for the user to add the taints as well. i am going to continue to do some research on the providers and what details we might need to implement. i would love to hear more thoughts and ideas here. |
Some thoughts / let me know if this is what is being asked:
|
this sounds ok to me, but i need to read up on the MachineSpec a little more.
++, i think we are on the same page here. i like this option as well. edit: forgot to add, yes @ncdc i think you've captured the requests accurately. |
Maybe we could do this with an OnCreate mutating webhook for node objects. This would require cloud provider info to be available on the node at first join, not sure if that's the case today or not. |
How would that work exactly? |
Nodes are objects just like anything else. Something creates the node object. If it's the kubelet itself, then this idea could work. If it's something else and the kubelet is given permissions on a preexisting node object to pencil in the details, it won't work. I don't know the specifics off the top of my head on how the kubelet gets created. If it were to work this way, webhook looks for corresponding machine (using same logic as nodelink controller) and applies relevant taints/labels from machine object. We might be able to do it later in the node's life cycle with UPDATE but that seems like it might be a lot of noise and add latency to the node object. Health status is on a dedicated lease object, so it shouldn't have a terrible impact, but either way, less than idea to catch all those calls. |
Yes, the kubelet is typically responsible for creating a Node to represent itself (although it could be created by someone/something else). I think a webhook could work for a self-managed cluster, where Cluster API is running on the cluster itself. I don't think it would work for a management cluster/workload cluster split, which is quite common for some Cluster API consumers (because the webhook runs in the workload cluster, but it would need access to the management cluster for Machine data). Having a bootstrap provider apply taints defined in the Machine spec when registering the node would work for both self-managed and management cluster configurations, but it does require every bootstrap provider to implement this functionality. This would be my preferred approach. |
If you don't care about the taints or labels being applied immediately, we use a controller to do that today: https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/nodelink/nodelink_controller.go#L244 This is probably adapted from the v1alpha1 code from here at some point. This would be a nice stand-in for bootstrap providers that don't expose kubelet options. This usually isn't an issue because the node takes a little time to go ready, so nothing will be scheduled there that shouldn't be (maybe daemonsets? TBD). Generally speaking, I would expect bootstrap providers to support setting kubelet options on an override basis. It's a never ending road of configuration management, though. As soon as you expose this functionality, it's another place to handle on upgrade. What options did the user set? Any deprecated options? Warn? Fail? Their custom options conflict with the options you set by default? My rule of thumb is don't allow any changes to configuration by the user unless it's absolutely necessary. Taints in the kubelet args would be the best way, but it's also the worst way. |
I think it does make sense to apply taints and labels at creation time, to avoid any possible gap where something might be able to "slip in" and get scheduled onto the node that otherwise wouldn't be allowed.
Could you expand a bit on who the actors are in this scenario, what's being overridden, and who is "you" ("the options you? set by default")? CAPI's stance on node settings such as taints and labels is that a bootstrap provider such as CAPBK sets them at node registration time, and any further administration of these things must be done by some other actor (a user, an authorized controller, etc.). |
The "you" here is the cluster-api project. If you provide these options, users will expect upgrades to still work. I suppose what I'm saying in regard to this is that it's not so much about taints and labels as it is other on-disk configuration and how this change might fit into that larger picture. |
i just wanted to update this pr with some of the ideas we have agreed upon during discussions. for the cpu, memory, and gpu fields which are needed by the autoscaler to create new instances from zero, we agreed that the information should be stored in the status field of the MachineDeployment and MachineSet resources. they would be populated by each cloud provider as defined by the instances which are available on that platform. these values should be scoped under a specific field in the status (eg for the taints, we agreed that this functionality could be added in a follow up. there are some issues around how we persist these taints down the kubelets and what contracts this imposes on the bootstrap provders. additionally, the ability to define taints on the MachineDeployments and MachineSets is a feature for CAPI and we should separate from the core functionality of scaling to and from zero. @ncdc has given me some good ideas about building a poc to demonstrate how this could work, and i am working to get that in place to demonstrate at a future meeting. i'm going to confer with @michaelgugino about updating this proposal. |
Do you mean a follow-up posted here, or in a separate proposal? I just want to make sure that we recognize that scaling from zero in AWS can't work correctly without these taints being available for the autoscaler to consider. |
i had meant in a follow up proposal, but if taints are a hard blocker for aws we might need to discuss this more. edit: is it possible we could do something on the cluster-autoscaler side that would make it appear as if there are no taints (an empty set) until we could add the extra functionality to properly recognize them? |
I assume that's how it would work today. In AWS, if your ASGs lack the taint-related resource tags, the autoscaler assumes that the nodes will bear no taints. It will create instances for nodes to satisfy pods based on other constraints, but could mistakenly create an instance for a pod with no tolerations, and wind up with a node with taints that repel that pod. If you don't taint your nodes, though, of course this doesn't present a problem. |
makes sense to me, but i have a question.
does this mean that a user could not add taints after creation?
i guess this is the question, could we move forward with an initial implementation that does not contain taints? |
Yes, I think if there are no taints involved, and you either can't create nodes with taints or are warned of the dangers of doing so, then we'd still have something useful to those who don't need that capability. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
for those following this effort, i have created #4283 based on a proof of concept i am working on. |
@fabriziopandini while i think there are some different ideas covered by this proposal, i am going to try and capture some of these ideas as future work in the other. so, i would be ok closing this pr, unless @michaelgugino has a strong objection. |
Hey folks, let's please dedupe in one PR |
i talked with @michaelgugino and he has no objection to closing this. |
@elmiko: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
Add proposal to document expected scale from zero semantics for integration between cluster autoscaler and cluster-api.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes # #2527