Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachineSet: support for defining specific machines when scaling down #75

Closed
rsdcastro opened this issue Apr 17, 2018 · 18 comments · Fixed by #726
Closed

MachineSet: support for defining specific machines when scaling down #75

rsdcastro opened this issue Apr 17, 2018 · 18 comments · Fixed by #726
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@rsdcastro
Copy link

This issue is to track ability to specify which machines to be removed from a machineset when it's scaling down. This is a requirement to support Cluster Autoscaling use cases.

Initial strawman using annotations to mark which machines to be removed would work, but there are concerns about not having an atomic operation for this kind of change.

cc @krousey @mwielgus @krzysztof-jastrzebski @MaciekPytel

@rsdcastro rsdcastro added this to the cluster-api-beta-implementation milestone Apr 17, 2018
@roberthbailey
Copy link
Contributor

Also see #45 (comment)

@hardikdr
Copy link
Member

hardikdr commented Aug 29, 2018

I can see following annotation based approach as a low-hanging fruit.

  • We by-default prioritize all machines to priority:3 via annotation – Machine-deployment basically adds this annotation on all machines while creation.
  • While scaling-down the machine-deployment, Autoscaler will decrease the priority of those specific set of machines to 1 by updating annotation: priority:1.
    • And then in separate call – scales-down the machine-deployment. That would avoid the risk of not-being atomic.
  • MachineDeployment/MachineSet reconciliation always prefers to delete the Machines with lowest-priority first and hence purpose would be served.

Though, for a long-term solution we might want to learn from Pod-Priority and PodClass: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass

Also fyi: kubernetes/enhancements#609

@ingvagabund
Copy link
Contributor

We can have multiple delete strategies, parallel to the image pull policy. E.g.

---
apiVersion: cluster.k8s.io/v1alpha1
kind: MachineSet
metadata:
  name: <NAME>
  namespace: <NAMESPACE>
  labels:
    ...
spec:
  replicas: 2
  deletePolicy: <POLICY>
  selector:
    ...
  template:
    ...

where the deletePolicy could be:

  • Random
  • Delete the oldest replica
  • Delete a replica with lower priority
    ..

It can default to the Random. The goal is to unblock the integration of the cluster API into the autoscaler in case we don't get agreement in reasonable time.

The code in question: https://github.com/kubernetes-sigs/cluster-api/blob/master/pkg/controller/machineset/controller.go#L325-L329

@roberthbailey
Copy link
Contributor

The autoscaler would only work with the lowest priority policy though, right? Otherwise the autoscaler would try to scale down by a specific machine (by changing the priority) and when the machine set was resized a random or oldest machine would instead be deleted.

@ingvagabund
Copy link
Contributor

ingvagabund commented Sep 25, 2018

Initial strawman using annotations to mark which machines to be removed would work, but there are concerns about not having an atomic operation for this kind of change.

Is there any solution that provides atomic operation?

Machineset controller should not change the number of replicas since it's up to machineset consumers to do it. The controller's responsibility is to either create new machines or delete existing (at the same time updating machineset's status). Based on that there is no mechanism to protect a machineset from very rapidly alternating changes of replicas field. In case a specific machine is annotated, the machineset controller still lists all potential machines to be deleted. It only changes the lottery to archery contest. Hit the bulls eye first. It's likely the following can happen:

  1. Annotate machine(s)
  2. Decrease number of replicas
  3. Start deleting machines, meantime someone removes the annotation label (e.g. annotating different machines)
  4. Machineset controller removes machines that were not supposed to be deleted.
  5. Oops

So ok, we can have another routine that will listen for machines and everytime a machine gets annotated, remove it and decrease the replicas:

  1. Machine is annotated
  2. Remove the machine
  3. Decrease the number of replicas. Oops, the machineset controller notices a machine is removed -> recreate a new one since the replicas has not changed
  4. Oops, the routine decreased the number of replicas (assuming there was no error)
  5. Machineset controller deletes a machine to compensate the replicas changed. Oh, different machine gets delete since the list of machines to be deleted is not sorted chronologically.

The replicas logic and deletion of a specific machine are tightly connected.

@roberthbailey
Copy link
Contributor

@maisem has been thinking about this as well, so adding him in here.

@ntfrnzn
Copy link
Contributor

ntfrnzn commented Dec 19, 2018

As per discussion on slack, the atomicity requirement seems unclear, and maybe we can move forward with a deletePolicy setup that is just

(a) deletePolicy = simple|newest|oldest
(b) simple policy attends to a known “delete-me” annotation on the machine, if it is present
(c) policies newest|oldest preferentially remove machines based on their age (since some time, probably ObjectMeta.CreationTimestamp)

An external oracle (if it exists) can annotate the machines based on external knowledge (if it chooses to) and thus affect the simple delete policy.

@timothysc timothysc added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jan 10, 2019
@timothysc timothysc modified the milestones: cluster-api-beta-implementation, v1alpha1 Jan 10, 2019
@timothysc timothysc added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jan 10, 2019
@ingvagabund
Copy link
Contributor

ingvagabund commented Jan 14, 2019

The current simple deletion policy [1] breaks machines into 3 classes:

  • must delete: deletion timestamp is non zero
  • better delete: machines with reported error
  • could delete: remaining machines

Our current implementation of cluster-api based cluster-autoscaler [2] targets the must delete class (which gets removed as first) with sigs.k8s.io/cluster-api-delete-machine machine annotation [3].
The process of scaling down then reduces to:

  1. cluster autoscaler annotates all candidate machines for scaling down
  2. cluster autoscaler reduces machine set number of replicas
  3. machineset controller starts removing all machines with sigs.k8s.io/cluster-api-delete-machine annotation before any other

So the process covers the b) case.

[1] https://github.com/kubernetes-sigs/cluster-api/blob/master/pkg/controller/machineset/delete_policy.go#L34-L42
[2] https://github.com/openshift/kubernetes-autoscaler/tree/master/cluster-autoscaler/cloudprovider/clusterapi
[3] #513

@ntfrnzn would that be applicable for your use case?

@MaciekPytel
Copy link

I'm still not convinced this approach is safe to use with Cluster Autoscaler. Sure it will work most of the time, but without atomic way to delete a specific machine you have all sorts of races that may lead to additional random machine being removed. At that point all guarantees given by CA about respecting things like PodDisruptionBudgets and various no-scale-down annotations are gone.

For scale-down CA doesn't work on machine sets, it look at individual nodes and checks which ones can be deleted. Ultimately what it does is removing a specific machine, NOT changing the desired number of machines. If someone else removes the machine while it's being drained the delete operation must fail.
In extreme case CA can even race with itself by scaling-up the machineset while it's removing a VM from it.

All those races are the reason for requesting an atomic implementation in the initial issue description.

@ntfrnzn
Copy link
Contributor

ntfrnzn commented Jan 14, 2019

From my perspective, yes, this is what I pictured for deletePolicy: simple.

In the meantime, I wrote a few lines of code for the time-based policies that have a sigmoid-like function, deletePriority(creationTimeStamp, now) -> (0.0, 100) that maps a machine into a deletion-priority range (float64). I hesitated over where to define the string deletePolicy (in MachineSet, MachineDeployment types), and haven't gotten back to it.

More importantly, I'm conflicted over the atomicity requirements. It complicates the types. One implementation would be to add a property like []string machineIdsMustNotExist to the MachineSet, so that the clusterAutoscaler can update the MachineSet by changing (a) the size and (b) the to-be-deleted-list on a single object atomically, and when the resync happens, the controller will remove the named machines and make the set the correct size.

@roberthbailey
Copy link
Contributor

/assign @maisem

@k8s-ci-robot
Copy link
Contributor

@roberthbailey: GitHub didn't allow me to assign the following users: maisem.

Note that only kubernetes-sigs members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @maisem

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hardikdr
Copy link
Member

hardikdr commented Feb 19, 2019

Ping.
There have been multiple efforts and discussions on this topic. Let's please try to discuss this and try to settle on the approach in the next meeting.
Achieving the atomicity seems really hard, not sure if there is really a way to do so.
I guess, we could decide on the simple priority-based approach for now and unblock the autoscaler integration.
Ref:

@maisem @roberthbailey @ingvagabund @ntfrnzn @erstapples @ingvagabund @MaciekPytel

@ncdc
Copy link
Contributor

ncdc commented Mar 5, 2019

This has priority/important-longterm. Do we think we'll be able to resolve any outstanding questions/issues in time for v1alpha1?

@ncdc
Copy link
Contributor

ncdc commented Mar 6, 2019

We didn't discuss this during today's meeting. I propose that this isn't required for v1alpha1 and we can defer it to a future milestone. WDYT @hardikdr @justinsb @detiber @rsdcastro?

@detiber
Copy link
Member

detiber commented Mar 6, 2019

Considering there is an open PR that could resolve this: https://github.com/kubernetes-sigs/cluster-api/pull/726/files I don't see a need to bump it unless we cannot get that PR merged in time.

@ncdc
Copy link
Contributor

ncdc commented Mar 6, 2019

There is also #513.

@vincepri
Copy link
Member

/assign

chuckha pushed a commit to chuckha/cluster-api that referenced this issue Oct 2, 2019
Build container image on all pushes and publish image on push to master
chuckha pushed a commit to chuckha/cluster-api that referenced this issue Oct 2, 2019
…ntroller-gen

Always use the right controller-gen
jayunit100 pushed a commit to jayunit100/cluster-api that referenced this issue Jan 31, 2020
go get dep will install HEAD version which is not
what we want.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.