Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capi: Replicas can go stale in nodeGroup causing erratic behaviour #3104

Closed
enxebre opened this issue Apr 30, 2020 · 2 comments · Fixed by #3177
Closed

capi: Replicas can go stale in nodeGroup causing erratic behaviour #3104

enxebre opened this issue Apr 30, 2020 · 2 comments · Fixed by #3177
Labels
area/provider/cluster-api Issues or PRs related to Cluster API provider

Comments

@enxebre
Copy link
Member

enxebre commented Apr 30, 2020

DeleteNodes() is sometimes called in subroutines https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.18.1/cluster-autoscaler/core/scale_down.go#L1067-L1090

But our implementation does not account for that properly https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go#L84-L149. And subroutines might compete to call setSize() with different values and getting stale replicas().

Also when we call setSize() we leave stale the replicas number in the machineSet for the scalableResource struct used by replicas() https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.18.1/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_machineset.go#L63-L85.

We need to:

/area provider/cluster-api

@k8s-ci-robot k8s-ci-robot added the area/provider/cluster-api Issues or PRs related to Cluster API provider label Apr 30, 2020
@enxebre enxebre changed the title Replicas can go stale in nodeGroup causing erratic behaviour capi: Replicas can go stale in nodeGroup causing erratic behaviour Apr 30, 2020
@enxebre
Copy link
Member Author

enxebre commented Apr 30, 2020

cc @elmiko

@elmiko
Copy link
Contributor

elmiko commented Apr 30, 2020

++, thanks Alberto!

enxebre added a commit to enxebre/autoscaler that referenced this issue May 4, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
Once we got all merge we'll put a PR upstream.
enxebre added a commit to enxebre/autoscaler that referenced this issue May 4, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
Once we got all merge we'll put a PR upstream.
enxebre added a commit to enxebre/autoscaler that referenced this issue May 4, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
Once we got all merge we'll put a PR upstream.
enxebre added a commit to enxebre/autoscaler that referenced this issue May 4, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
Once we got all merge we'll put a PR upstream.
enxebre added a commit to enxebre/autoscaler that referenced this issue May 4, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
Once we got all merge we'll put a PR upstream.
enxebre added a commit to enxebre/autoscaler that referenced this issue May 4, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
Once we got all merge we'll put a PR upstream.
elmiko added a commit to elmiko/kubernetes-autoscaler that referenced this issue May 4, 2020
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
enxebre added a commit to enxebre/autoscaler that referenced this issue May 5, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
Once we got all merge we'll put a PR upstream.
enxebre added a commit to enxebre/autoscaler that referenced this issue May 5, 2020
When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
Once we got all of them merged we'll put a PR upstream.
elmiko added a commit to elmiko/kubernetes-autoscaler that referenced this issue May 5, 2020
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
Once a solution has been reached, this will be contribued upstream.
elmiko added a commit to elmiko/kubernetes-autoscaler that referenced this issue May 5, 2020
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
Once a solution has been reached, this will be contribued upstream.
elmiko added a commit to elmiko/kubernetes-autoscaler that referenced this issue Jun 2, 2020
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
elmiko pushed a commit to elmiko/kubernetes-autoscaler that referenced this issue Jun 2, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
elmiko pushed a commit to elmiko/kubernetes-autoscaler that referenced this issue Jun 2, 2020
provider

When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
d-mo pushed a commit to mistio/autoscaler that referenced this issue Jun 18, 2020
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
d-mo pushed a commit to mistio/autoscaler that referenced this issue Jun 18, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
d-mo pushed a commit to mistio/autoscaler that referenced this issue Jun 18, 2020
provider

When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
detiber pushed a commit to detiber/autoscaler that referenced this issue Jul 23, 2020
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
detiber pushed a commit to detiber/autoscaler that referenced this issue Jul 23, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
detiber pushed a commit to detiber/autoscaler that referenced this issue Jul 23, 2020
provider

When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
detiber pushed a commit to detiber/autoscaler that referenced this issue Aug 5, 2020
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
detiber pushed a commit to detiber/autoscaler that referenced this issue Aug 5, 2020
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
detiber pushed a commit to detiber/autoscaler that referenced this issue Aug 5, 2020
provider

When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
benmoss pushed a commit to benmoss/autoscaler that referenced this issue Sep 28, 2020
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
ghost pushed a commit to Capillary/autoscaler that referenced this issue Jan 28, 2021
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
ghost pushed a commit to Capillary/autoscaler that referenced this issue Jan 28, 2021
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
ghost pushed a commit to Capillary/autoscaler that referenced this issue Jan 28, 2021
provider

When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
colin-welch pushed a commit to Paperspace/autoscaler that referenced this issue Mar 5, 2021
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
aksentyev pushed a commit to aksentyev/autoscaler that referenced this issue Apr 9, 2021
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
aksentyev pushed a commit to aksentyev/autoscaler that referenced this issue Apr 9, 2021
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
aksentyev pushed a commit to aksentyev/autoscaler that referenced this issue Apr 9, 2021
provider

When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
piotrnosek pushed a commit to piotrnosek/autoscaler that referenced this issue Nov 30, 2021
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
piotrnosek pushed a commit to piotrnosek/autoscaler that referenced this issue Nov 30, 2021
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
piotrnosek pushed a commit to piotrnosek/autoscaler that referenced this issue Nov 30, 2021
provider

When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
enxebre added a commit to enxebre/autoscaler that referenced this issue Jul 13, 2022
This ensured that access to replicas during scale down operations were never stale by accessing the API server kubernetes#3104.
This honoured that behaviour while moving to unstructured client kubernetes#3312.
This regressed that behaviour while trying to reduce the API server load kubernetes#4443.
This put back the never stale replicas behaviour at the cost of loading back the API server kubernetes#4634.

This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache.

Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.
enxebre added a commit to enxebre/autoscaler that referenced this issue Jul 13, 2022
This ensured that access to replicas during scale down operations were never stale by accessing the API server kubernetes#3104.
This honoured that behaviour while moving to unstructured client kubernetes#3312.
This regressed that behaviour while trying to reduce the API server load kubernetes#4443.
This put back the never stale replicas behaviour at the cost of loading back the API server kubernetes#4634.

Currently on e.g a 48 minutes cluster it does 1.4k get request to the scale subresource.
This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache.

Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.
enxebre added a commit to enxebre/autoscaler that referenced this issue Jul 13, 2022
This ensured that access to replicas during scale down operations were never stale by accessing the API server kubernetes#3104.
This honoured that behaviour while moving to unstructured client kubernetes#3312.
This regressed that behaviour while trying to reduce the API server load kubernetes#4443.
This put back the never stale replicas behaviour at the cost of loading back the API server kubernetes#4634.

Currently on e.g a 48 minutes cluster it does 1.4k get request to the scale subresource.
This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache.

Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.
navinjoy pushed a commit to navinjoy/autoscaler that referenced this issue Oct 26, 2022
This ensured that access to replicas during scale down operations were never stale by accessing the API server kubernetes#3104.
This honoured that behaviour while moving to unstructured client kubernetes#3312.
This regressed that behaviour while trying to reduce the API server load kubernetes#4443.
This put back the never stale replicas behaviour at the cost of loading back the API server kubernetes#4634.

Currently on e.g a 48 minutes cluster it does 1.4k get request to the scale subresource.
This PR tries to satisfy both non stale replicas during scale down and prevent the API server from being overloaded. To achieve that it lets targetSize which is called on every autoscaling cluster state loop from come from cache.

Also note that the scale down implementation has changed https://github.com/kubernetes/autoscaler/commits/master/cluster-autoscaler/core/scaledown.
tim-smart pushed a commit to arisechurch/autoscaler that referenced this issue Nov 22, 2022
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
tim-smart pushed a commit to arisechurch/autoscaler that referenced this issue Nov 22, 2022
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
tim-smart pushed a commit to arisechurch/autoscaler that referenced this issue Nov 22, 2022
provider

When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/cluster-api Issues or PRs related to Cluster API provider
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants