-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request throttling for AWS #31858
Comments
Hi, May someone perform some investigation? This issue block us totally with using kubernetes. |
Hi, |
@anioool we do have a global back-off, but this only prevents k8s from totally DoS-ing your AWS account, it doesn't really keep k8s working smoothly when it is hitting a limit. We need to figure out where these calls are coming from, and then figure out a way to make them less expensive. Do you have any ideas what is causing it? Are you rapidly attaching and detaching volumes? I suspect 500 volumes is just more than we've tried before - I'll poke around and see whether I can see what is happening. |
Hi @justinsb, I want to thank you for your reply. Let me describe how our process works. So process is splitted into two phases called base and replication. In first stage - base - we are creating in automatic way approx. 500 volumes from snapshots prepared previously. Deployments for theses volumes are created in one batch. Second phase is replication, so creating new volumes from snapshots and creationg new deployments. During second stage detaching may appear. There is a chance that we will reach scale up to 5 k volumes. I check your code and that is true that you have global back-off, which is not so much usable as controller manager according to my observation is able to make up to 10 AWS requests - describeVolumes - per minute. For all of them back off set exactly the same amount off time for delay if throttling appears. That is one of the reason why we are facing Dos-ing for EC2 service on AWS account. Second problem is that controllermanager is making calls separately for each volume, while we could use batches. For our account during first phase we had 95 % of describeVolumes calls and only 5% of describeInstances. Let me know if you need some additional information. |
Hi @justinsb, Any news in this topic? |
This needs to be triaged as a release-blocker or not for 1.5 |
@justinsb all issues must be labeled either release blocker or non release blocking by end of day 18 November 2016 PST. (or please move it to 1.6) cc @kubernetes/sig-aws |
We're seeing a similar issue but with DescribeInstances Taking a quick look at the code the main place we call describe instances is for checking volumes are attached. We recently added a few volumes to the cluster but only a handful < 10 per cluster. All our controller managers are now spamming the AWS API and now are all limited. E.g.
Any idea what could be causing the requests @justinsb ? We're on 1.4.7 |
We have seen similar issues between 10-20 EBS volumes between a few clusters on both 1.4.6 and 1.4.7 |
Unless someone beats me too it I am going to open an issue on the DescribeInstances this morning. |
@sebbonnet what version are you on? |
@chrislovecnm (i work with @sebbonnet ) 1.4.7 |
cc @kubernetes/sig-aws-misc @kubernetes/sig-storage-misc |
This is affecting us as well. |
I think we have a 1.3.x spam problem at scale, which is this issue. We also have another issue not at scale #39526 in 1.4.x+ |
We are affected by this issue as well. We have around 30 EBS volumes used by petsets. When a controller-manager hits AWS api limits, things start to go crazy. Our pods with volumes start moving around constantly. |
Correct me if I am wrong. After looking at the code, looks like, given 10 EBS volumes and one node, kubernetes will do 11 AWS api calls. One call to describe each individual volume separately and another call to describe an instance. This does not sound very efficient. What if instead it did a one call to describe a list of volumes? Because right now, the issue gets worse the more volumes you have. /cc @justinsb |
It will do DescribeVolume polling during attach and detach operations only, I believe. But that should be self-correcting, because we will throttle if there are too many operations in progress. Can you link to the code or provide a log if you are seeing something different? The problem is the "one node" call I believe, because we call that for nodes regardless of "current" volume operations, currently every 10 seconds. That is fixed in #39564 |
@justinsb that's good to know. We're running v1.4.7 and constantly being hit by API rate limits. I guess the question is how does volume reconciler behave when detachRequestedTime timer does not get reset? Because when it's unable to get volumes state status from AWS because of rate limits, the detach timer never gets reset. What happens then? |
@justinsb can this be closed? |
This is too late for v1.6, moving to v1.7 milestone. If this is incorrect please correct. /cc @kubernetes/release-team |
@kubernetes/sig-aws-misc do you want this in v1.7? is this the right sig to own this issue? |
After implementing bulk volume polling, this bug should be a thing of the past. IMO - we can close this bug. #41306 |
@gnufied do we have an issue open about route53 DNS? Also, I do not know if we understand our limits yet. Maybe separate issues. Storage is much better |
@justinsb Can this issue be closed? |
I think we close this one and open focused one for issues in 1.7, e.g. Route53. |
Hi,
We are facing huge problem with throttling requests to AWS espiecially for describe methods. Our cluster has more than five hundred volumes attached. Also depending on the load we are attaching/detaching volumes as well. As an outcome from controller manager logs I see a lot of entries for delay like this 👍
As you can see CM postpones requests, however request from the same minute are postpone with exactly the same time - amount of seconds.
As an outcome from AWS account perspective we are not able to execute other things not related to kubernetes cluster as we are constantly receive Request Limit exceeded.
Is there possiblity to allow CM use batches for Describe methods and not make call per volume/instance and so on separately?
Our version:
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.6", GitCommit:"ae4550cc9c89a593bcda6678df201db1b208133b", GitTreeState:"clean", BuildDate:"2016-08-26T18:13:23Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.6+coreos.0", GitCommit:"f6f0055b8e503cbe5fb7b6f1a2ee37d0f160c1cd", GitTreeState:"clean", BuildDate:"2016-08-29T17:01:01Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Thanks in advance.
Adam
The text was updated successfully, but these errors were encountered: