Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal API for retrying HTTP requests #518

Merged
merged 8 commits into from
May 22, 2019
Merged

Conversation

hiranya911
Copy link
Contributor

The existing HttpClient retries all HTTP requests once that are failing due to connection timeout and reset errors. We would like to further extend this retries support to meet the following new requirements:

  1. Ability to retry requests on specific HTTP error codes (e.g. 503)
  2. Ability to delay subsequent retries according to the Retry-After header sent by the server
  3. Ability to use exponential backoff
  4. Ability to configure retries for individual services in the SDK

In order to meet all the above requirements, this PR introduces a new internal interface called RetryConfig. The HttpClient optionally takes a RetryConfig as a constructor parameter, and retries failing requests based on that.

For the moment, we use the following as the default RetryConfig in order to retain the existing behavior:

const DEFAULT_RETRY_CONFIG: RetryConfig = {
  maxRetries: 1,
  ioErrorCodes: ['ECONNRESET', 'ETIMEDOUT'],
  maxDelayInMillis: 60 * 1000,
};

We can change this to something more sophisticated in the future (in other Admin SDKs we retry up to 4 times, and also retry on 500 and 503 HTTP errors).

@@ -166,8 +166,66 @@ export class HttpError extends Error {
}
}

/**
* Specifies how failing HTTP requests should be retried.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what the different fields mean in this interface (an example would be helpful). For example maxDelayInMillis is per retry and not in total per request, etc. backoff factor in seconds, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return this.sendWithRetry(config, attempts + 1);
})
.catch((err: LowLevelError) => {
const [delayMillis, canRetry] = this.getRetryDelayMillis(retryAttempts, err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The private functions added do not provide enough descriptions on what is being computed or a summary of the underlying logic making it harder to deduce the underlying behavior. It helps to either add more comments here or descriptions of the private functions below.

Copy link
Contributor Author

@hiranya911 hiranya911 May 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getRetryDelayMillis() is the confusing one, although I feel that the variable names explain what's going on. In any case, I've added a jsdoc comment to the method to further explain it.

}).should.eventually.be.rejectedWith(err).and.have.property('code', 'app/network-error');
});

it('should not retry when for error codes that are not configured', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not retry when error codes are not configured

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


private backOffDelayMillis(retryAttempts: number): number {
if (retryAttempts === 0) {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you not wait on the first retry?
You will end up with a wait pattern: first fail, 0, 2000, 4000
Instead of: first fail, 1000, 2000, 4000
I think the latter has a better likelihood of succeeding with less time overall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to retain the existing behavior, and also to align with how retries are implemented in other languages/libraries. Basically we only delay retries in the event of consecutive errors. Here's a similar implementation from Python's urllib3 package: https://github.com/urllib3/urllib3/blob/master/src/urllib3/util/retry.py#L222

In general, the existing strategy of immediately retrying on the first error has worked well for us so far. This is particularly useful in environments like GCF where low-level transient errors seem to be common.

expect(resp.data).to.deep.equal(respData);
expect(resp.isJson()).to.be.true;
expect(delayStub.callCount).to.equal(1);
expect(delayStub.args[0][0]).to.be.gt(27 * 1000).and.to.be.lte(30 * 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was 27 * 1000 determined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed to make the assertion exact by using sinon fake timers.

@hiranya911
Copy link
Contributor Author

Thanks @bojeil-google. Made most of the suggested changes. Ready for another look.

/** Maximum number of times to retry a given request. */
maxRetries: number;

/** HTTP status codes taht should be retried. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ioErrorCodes?: string[];

/**
* The multiplier for exponential back off. The retry delay is calculated using the formula `(2^n) * backOffFactor`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that it's in seconds
The retry delay is calculated using the formula (2^n) * backOffFactor seconds,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants