Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK-1975] Use exponential backoff rather than rate limit headers #538

Merged
merged 6 commits into from
Oct 14, 2020

Conversation

adamjmcgrath
Copy link
Contributor

@adamjmcgrath adamjmcgrath commented Sep 29, 2020

Changes

Currently RetryRestClient uses the X-RateLimit-Reset header to calculate retry duration, eg if you hit a per minute rate limit, the backoff for 10 retries will be: [60s, 60s, ..., 60s, 60s]

But the X-RateLimit-Reset only tells when the bucket of available requests is full - not when there might be a request available in the bucket to use. So 60s will likely be too long to wait, and cause things like Webtasks to fail.

Instead we are switching to a generic exponential backoff with jitter, starting at ~1sec, eg for 10 retries: [ 1-2s, 2-4s, 4-8s, 8-16s, 16-32s, 32-64s, 64-128s, 128-256s, 256-512s, 512-1024s ]

Calculated by (1 + Math.random()) * 1sec * Math.pow(2, attemptNumber) - happy to tweak these numbers if we want

The SDK already uses retry - so we can leverage the default behaviour of this library

References

https://auth0team.atlassian.net/browse/ESD-8390
https://auth0.com/docs/policies/rate-limit-policy#handle-rates-limitations-in-code
https://auth0.com/docs/policies/rate-limit-policy/management-api-endpoint-rate-limits
https://github.com/tim-kos/node-retry

Testing

  • This change adds unit test coverage
    - [ ] This change adds integration test coverage

Copy link
Contributor

@joshcanhelp joshcanhelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and simple, love it!

It might be nice to allow all the retry options to be passed in as an object. Not necessary but could make it easier to adjust based on what the time/rate limits are.

One thing that came to mind ... we're removing the reliance on the headers, which means we're ignoring info we have that mind make us skip retry altogether. The management API limit might be hit within, say, 10 seconds and not have anything available for another 50. So we might try 4 additional when we already know that we'll hit a 429 for 50 more seconds.

It might be that this kind of logic would complicate things and we'd, at best, only save ~4 API calls. Just wanted to bring it up.

src/RetryRestClient.js Outdated Show resolved Hide resolved
src/RetryRestClient.js Outdated Show resolved Hide resolved
@adamjmcgrath
Copy link
Contributor Author

adamjmcgrath commented Sep 29, 2020

Thanks @joshcanhelp!

The management API limit might be hit within, say, 10 seconds and not have anything available for another 50. So we might try 4 additional when we already know that we'll hit a 429 for 50 more seconds.

Looking at that management api rate limit doc, it says:

for the given bucket, there is a maximum request limit of x per minute, and for each minute that elapses, permissions for y requests are added back. In other words, for each 60/y seconds, one additional request is added to the bucket.

Which tells me that requests are added back to the bucket at a rate of 60/per_minute_rate_limit per second, so when you say "hit within, say, 10 seconds and not have anything available for another 50" - I would expect only an endpoint that was limited at '1 per minute' would behave like this - and I don't see any endpoints like that (except possibly the signing key rotation one)

Although, looking at that doc, I wonder if I should make the minTimeout500ms rather than 1000ms

@adamjmcgrath
Copy link
Contributor Author

It might be nice to allow all the retry options to be passed in as an object.

Good idea 👍

@joshcanhelp
Copy link
Contributor

@adamjmcgrath - Ah, I see, it's a rolling check ... makes more sense than "sorry, this API won't work for another minute" 😆 thank god I'm not in charge of setting those limits!

In that case, with access to all the vars, this looks great! I'll bump the version in our Rule utility library and update the docs once it's ready.

@jimmyjames
Copy link
Contributor

@adamjmcgrath looks like the codecov task is failing due to a decrease in coverage. Is that something that should be addressed in this PR, or do we need to re-evaluate our coverage requirements?

@davidpatrick will be good for you to review this just to ensure there's no risk here that I can't see. It looks good to me 👍

@adamjmcgrath
Copy link
Contributor Author

@adamjmcgrath looks like the codecov task is failing due to a decrease in coverage.

Thanks @jimmyjames - it's the project coverage threshold, we should probably ignore them. I deleted a bunch of code that had 100% coverage so the overall level has decreased. The only way to make that green would be to go and find some unrelated uncovered code and write some tests for it (which I'm happy to do, just not in this pr)

@davidpatrick will be good for you to review this just to ensure there's no risk here that I can't see. It looks good to me 👍

I'm going to get someone from #iam-core-foundations to take a look as well, I just haven't gotten around to pinging them yet

@adamjmcgrath
Copy link
Contributor Author

@jimmyjames @davidpatrick #iam-core-foundation has no issues so if one of you could approve, I'll go ahead and merge

@davidpatrick davidpatrick added this to the v2Next milestone Oct 14, 2020
@adamjmcgrath adamjmcgrath merged commit cf2a1e5 into master Oct 14, 2020
@adamjmcgrath adamjmcgrath deleted the retry branch October 14, 2020 16:08
@davidpatrick davidpatrick mentioned this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants