Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive cient-side throttling in the clusterapi provider for cluster-autoscaler #6333

Closed
kdw174 opened this issue Dec 1, 2023 · 9 comments · Fixed by #6416
Closed

Excessive cient-side throttling in the clusterapi provider for cluster-autoscaler #6333

kdw174 opened this issue Dec 1, 2023 · 9 comments · Fixed by #6416
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@kdw174
Copy link
Contributor

kdw174 commented Dec 1, 2023

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
1.27.1 and 1.26.2

What environment is this in?:
clusterapi

What did you expect to happen?:
cluster-autoscaler to meet the SLO described here

What happened instead?:
We are seeing excessive, 200-300 times a minute, client-side throttling in our clusters that have a higher number of machinedeployments and nodes. ~50 machinedeployments and 100-200 nodes.

Waited for 194.425534ms due to client-side throttling, not priority and fairness, request: GET:https://{apiserver}:443/apis/cluster.x-k8s.io/v1beta1/namespaces/{namesapce}/machinedeployments/{machinedeplyoment}/scale

This is causing the main loop to take close to 2 minutes in some cases.

2023-12-01 12:55:21.692 | I1201 12:55:21.692565       1 metrics.go:409] Function main took 1m46.646684648s to complete

When we need to scale up multiple machinedeployments at once, this can severely delay a pod getting scheduled. We have seen pods pending for over an hour. Let's say we have 10 machinedeployments that all need to scale up at once because of a rolling deployment on dedicated nodes or scaling event. It will take a minimum of 10*1m46s to scale up replicas in each machinedeployment from the initial list. If additional machinedeployments need scaled during that scaling period, it will make it take even longer.

How to reproduce it (as minimally and precisely as possible):
Use cluster-autoscaler with the clusterapi provider. Setup a cluster with a higher number of nodes (100-200) and machinedeployments (~50). You should see heavy client-side throttling in the logs and the main function taking over a minute to complete.

If you scale up pod replicas so multiple (10) machinedeployments scale up at once, it will take many minutes for the last pod to get scheduled on a node.

Anything else we need to know?:
The clusterapi provider creates additional kube client configs that use the default qps and burst configuration in the rest client. Currently, there is no way to override those values to either increase the rate limit or remove it all together and rely on the API Priority and Fairness (APF).

It looks like this recent PR #6274 attempted to remove the rate limit on the cluster autoscaler kube client, but I'm not sure that will work as expected. By not setting QPS it would take the default value of 5.0. I think QPS would need to be set to -1.0 by default if the goal was to remove the client-side rate limit by default.

The proposal would be to add the kube-client-qps and kube-client-burst flags back, but set the kube-client-qps default value to -1.0. Those can then be passed down to the clusterapi provider kube client configs to allow tuning there.

@kdw174 kdw174 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 1, 2023
@elmiko
Copy link
Contributor

elmiko commented Dec 18, 2023

thanks for raising this @kdw174 , i guess we'll need a way for the user to override the client settings for the non-management client.

do you think a good starting point would be to have the other clients use the same settings as the primary client?

further, if we added environment variables as a way to override these value, would that be a sufficient first step?

@kdw174
Copy link
Contributor Author

kdw174 commented Jan 3, 2024

Thanks for the review @elmiko. What's the process for getting this backported to older versions? Do I need to open up pr's against the cluster-autoscaler-release branches or is there more to it?

@elmiko
Copy link
Contributor

elmiko commented Jan 3, 2024

@kdw174 good question, in the past there have been issues created by the person doing a release to collect backport PRs, but i have to imagine you could open those PRs against the historical versions.

perhaps @x13n @towca or @feiskyer could advise

@feiskyer
Copy link
Member

feiskyer commented Jan 5, 2024

yes, please open cherry-pick PRs to the release branches.

@kdw174
Copy link
Contributor Author

kdw174 commented Jan 10, 2024

Thanks, opened prs for 1.27-1.29

@AmanPathak-DevOps
Copy link

AmanPathak-DevOps commented Oct 17, 2024

Getting the same thing on AWS EKS Cluster for version 1.30
Is it not updated in 1.30 version or do we have to add Our own?
Appreciated for the help. Thanks

spec:
      containers:
      - name: cluster-autoscaler
        command:
        - ./cluster-autoscaler
        - --kube-api-qps=20            
        - --kube-api-burst=40     

@elmiko
Copy link
Contributor

elmiko commented Oct 17, 2024

@AmanPathak-DevOps it does seem like this code (or the current version) is in 1.30 release, see https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.30/cluster-autoscaler/main.go#L460-L462

perhaps you need different values for qps/burst?

or, it could be a bug if they are not respected.

@AmanPathak-DevOps
Copy link

Hey, @elmiko ! Thanks for getting back to me. I'm encountering some issues while managing the EKS Cluster. I've deployed CA on my EKS Cluster, but there are instances when CA restarts because the node goes into an Unknown state. Consequently, new worker nodes can't be created as CA is Pending, and other nodegroups can't run CA pods as they lack capacity. What approach would you recommend to resolve this? I've tried a manual approach where I increase the min size of a nodegroup.

@elmiko
Copy link
Contributor

elmiko commented Oct 22, 2024

it sounds like perhaps a PodDisruptionBudget on the CA might help to prevent it from being restarted and not getting rescheduled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants