-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression with the AWS provider in 0.5.9 #869
Comments
Hi @lbernail, thanks for your report. That looks like a serious problem. For a quick fix you could also try if using the registry cache helps. Caching in the provider would be fine but would add another layer of caching which could become very confusing if done wrong. We could think about fixing the code in #742 to avoid making a call to AWS. |
From looking at the code it seems that using the txt-registry cache doesn't help. Regarding a fix: It looks like We could add this here and pass the list of records to each call of |
Yes the cache won't work because we are calling the provider Records method and not the one from the registry Doing 2 calls would be a lot better but the second one still feels unnecessary (in our case we have zones with 1000s of records and we run a slightly modified version which sleeps in the ListResourceRecordSetsPages callback to avoid triggering AWS rate-limits and a full zone sync takes 10+ seconds |
It's not ideal but much better than the current behaviour and fairly safe to implement. We could improve it with proper caching later. What do you think? |
Definitely a lot better Another idea, what about passing the recordsCache to the provider ApplyChanges call? We could just ignore it in other providers (I assume some could make use of it too) |
I think this is affecting us as well. However, we didn't experience this until I'm not sure why any of the domains needed to be updated anyways given there were no changes other than upgrading from Update: downgrading to |
@lbernail I worked on a rather quick fix to reduce the number of requests from O(#records) to O(#zones) which should already help: #876. Using the txt registry cache could lead to confusion and would only work for people using the txt registry. It probably makes sense to somehow also cache in the individual providers as they see fit. Subsequent calls to |
@nesl247 It's another bug where the planning logic thinks it needs to update the records when in fact no update is needed. It persists from interval to interval and results in a lot of updates which then trigger the calls to AWS. We tried to fix it here but I'm not sure we want to accept the PR this way. There might be a simpler way to implement it. Alternatively, we could use the "old" way of detecting changes (basically reverting parts of #650) or drop the |
@linki : thanks a lot, this is definitely a lot better. I don't think it is enough for us because of the size of our zone on AWS (4x10+s is very long compared to <1ms with registry cache). I agree with the risk of confusion with the registry cache but I also wonder if having two different caches may not make things complicated. Today the registry cache is very efficient: we only resync the cache once an hour so calls to Records() only take <1ms except when a full sync is performed. |
Can we update the release notes to warn people about this performance regression in v0.5.10? I upgraded to v0.5.10 yesterday and wasted a bunch of time before concluding that this new version of external-dns was buggy and then eventually finding this issue, so it would be nice to save others some time by putting a big fat warning that Route53 will throttle them with v0.5.10 :P |
@linki @njuettner What about modifying the release notes for v0.5.10 and/or publishing v0.5.9 as v0.5.11 to make sure that we make it safe for the user? This problem has been around too long and caused too much issue to ignore it. |
@jhohertz thanks for the investigation. @linki is not available for the next 3 Months and @njuettner is here next week again. |
I did the code review and both seem to be ok to me. |
Thanks @Raffo |
Hello
We tried to update our deployment to 0.5.9 but we noticed a significant performance regression due to #742
After this PR, for each change we call
Records()
instead of calling it once per plan https://github.com/kubernetes-incubator/external-dns/blob/v0.5.9/provider/aws.go#L367 which will retrieve all records for the zone.A simple fix could be to store the result of the Records() call in the AWSProvider. I can PR this if you think it makes sense.
In addition, the call to
ListResourceRecordSetsPages
in the Records() function is paginated but the aws go sdk does not take into account rate limits. My understanding is that each page is 100 records which means on large zone we can quickly hit the rate limit which is 5 rps (from https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/DNSLimitations.html#limits-api-requests)We could retry and backoff on rate-limit errors but given how low the limit is and how simple the call pattern is, we could also simply sleep for 200ms (or 250ms to be safe, or even make it configurable) at the end of the callback.
Laurent
The text was updated successfully, but these errors were encountered: