-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
capi: Replicas can go stale in nodeGroup causing erratic behaviour #3104
Labels
area/provider/cluster-api
Issues or PRs related to Cluster API provider
Comments
k8s-ci-robot
added
the
area/provider/cluster-api
Issues or PRs related to Cluster API provider
label
Apr 30, 2020
enxebre
changed the title
Replicas can go stale in nodeGroup causing erratic behaviour
capi: Replicas can go stale in nodeGroup causing erratic behaviour
Apr 30, 2020
cc @elmiko |
++, thanks Alberto! |
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
May 4, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104 Once we got all merge we'll put a PR upstream.
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
May 4, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104 Once we got all merge we'll put a PR upstream.
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
May 4, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104 Once we got all merge we'll put a PR upstream.
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
May 4, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104 Once we got all merge we'll put a PR upstream.
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
May 4, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104 Once we got all merge we'll put a PR upstream.
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
May 4, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104 Once we got all merge we'll put a PR upstream.
elmiko
added a commit
to elmiko/kubernetes-autoscaler
that referenced
this issue
May 4, 2020
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
May 5, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104 Once we got all merge we'll put a PR upstream.
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
May 5, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104 Once we got all of them merged we'll put a PR upstream.
elmiko
added a commit
to elmiko/kubernetes-autoscaler
that referenced
this issue
May 5, 2020
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104 Once a solution has been reached, this will be contribued upstream.
elmiko
added a commit
to elmiko/kubernetes-autoscaler
that referenced
this issue
May 5, 2020
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104 Once a solution has been reached, this will be contribued upstream.
elmiko
added a commit
to elmiko/kubernetes-autoscaler
that referenced
this issue
Jun 2, 2020
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
elmiko
pushed a commit
to elmiko/kubernetes-autoscaler
that referenced
this issue
Jun 2, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104
elmiko
pushed a commit
to elmiko/kubernetes-autoscaler
that referenced
this issue
Jun 2, 2020
provider When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104
d-mo
pushed a commit
to mistio/autoscaler
that referenced
this issue
Jun 18, 2020
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
d-mo
pushed a commit
to mistio/autoscaler
that referenced
this issue
Jun 18, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104
d-mo
pushed a commit
to mistio/autoscaler
that referenced
this issue
Jun 18, 2020
provider When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104
detiber
pushed a commit
to detiber/autoscaler
that referenced
this issue
Jul 23, 2020
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
detiber
pushed a commit
to detiber/autoscaler
that referenced
this issue
Jul 23, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104
detiber
pushed a commit
to detiber/autoscaler
that referenced
this issue
Jul 23, 2020
provider When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104
detiber
pushed a commit
to detiber/autoscaler
that referenced
this issue
Aug 5, 2020
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
detiber
pushed a commit
to detiber/autoscaler
that referenced
this issue
Aug 5, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104
detiber
pushed a commit
to detiber/autoscaler
that referenced
this issue
Aug 5, 2020
provider When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104
benmoss
pushed a commit
to benmoss/autoscaler
that referenced
this issue
Sep 28, 2020
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
ghost
pushed a commit
to Capillary/autoscaler
that referenced
this issue
Jan 28, 2021
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
ghost
pushed a commit
to Capillary/autoscaler
that referenced
this issue
Jan 28, 2021
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104
ghost
pushed a commit
to Capillary/autoscaler
that referenced
this issue
Jan 28, 2021
provider When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104
colin-welch
pushed a commit
to Paperspace/autoscaler
that referenced
this issue
Mar 5, 2021
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
aksentyev
pushed a commit
to aksentyev/autoscaler
that referenced
this issue
Apr 9, 2021
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
aksentyev
pushed a commit
to aksentyev/autoscaler
that referenced
this issue
Apr 9, 2021
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104
aksentyev
pushed a commit
to aksentyev/autoscaler
that referenced
this issue
Apr 9, 2021
provider When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104
piotrnosek
pushed a commit
to piotrnosek/autoscaler
that referenced
this issue
Nov 30, 2021
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
piotrnosek
pushed a commit
to piotrnosek/autoscaler
that referenced
this issue
Nov 30, 2021
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104
piotrnosek
pushed a commit
to piotrnosek/autoscaler
that referenced
this issue
Nov 30, 2021
provider When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
Jul 13, 2022
This ensured that access to replicas during scale down operations were never stale by accessing the API server kubernetes#3104. This honoured that behaviour while moving to unstructured client kubernetes#3312. This regressed that behaviour while trying to reduce the API server load kubernetes#4443. This put back the never stale replicas behaviour at the cost of loading back the API server kubernetes#4634. This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache. Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
Jul 13, 2022
This ensured that access to replicas during scale down operations were never stale by accessing the API server kubernetes#3104. This honoured that behaviour while moving to unstructured client kubernetes#3312. This regressed that behaviour while trying to reduce the API server load kubernetes#4443. This put back the never stale replicas behaviour at the cost of loading back the API server kubernetes#4634. Currently on e.g a 48 minutes cluster it does 1.4k get request to the scale subresource. This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache. Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.
enxebre
added a commit
to enxebre/autoscaler
that referenced
this issue
Jul 13, 2022
This ensured that access to replicas during scale down operations were never stale by accessing the API server kubernetes#3104. This honoured that behaviour while moving to unstructured client kubernetes#3312. This regressed that behaviour while trying to reduce the API server load kubernetes#4443. This put back the never stale replicas behaviour at the cost of loading back the API server kubernetes#4634. Currently on e.g a 48 minutes cluster it does 1.4k get request to the scale subresource. This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache. Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.
navinjoy
pushed a commit
to navinjoy/autoscaler
that referenced
this issue
Oct 26, 2022
This ensured that access to replicas during scale down operations were never stale by accessing the API server kubernetes#3104. This honoured that behaviour while moving to unstructured client kubernetes#3312. This regressed that behaviour while trying to reduce the API server load kubernetes#4443. This put back the never stale replicas behaviour at the cost of loading back the API server kubernetes#4634. Currently on e.g a 48 minutes cluster it does 1.4k get request to the scale subresource. This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache. Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.
tim-smart
pushed a commit
to arisechurch/autoscaler
that referenced
this issue
Nov 22, 2022
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104
tim-smart
pushed a commit
to arisechurch/autoscaler
that referenced
this issue
Nov 22, 2022
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas. This is one in a series of PR to mitigate kubernetes#3104
tim-smart
pushed a commit
to arisechurch/autoscaler
that referenced
this issue
Nov 22, 2022
provider When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize(). This is one in a series of PR to mitigate kubernetes#3104
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
DeleteNodes() is sometimes called in subroutines https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.18.1/cluster-autoscaler/core/scale_down.go#L1067-L1090
But our implementation does not account for that properly https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go#L84-L149. And subroutines might compete to call setSize() with different values and getting stale replicas().
Also when we call
setSize()
we leave stale the replicas number in the machineSet for the scalableResource struct used byreplicas()
https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.18.1/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_machineset.go#L63-L85.We need to:
/area provider/cluster-api
The text was updated successfully, but these errors were encountered: