Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 upload retrying #674

Closed
rajyan opened this issue May 12, 2023 · 6 comments · Fixed by #675
Closed

S3 upload retrying #674

rajyan opened this issue May 12, 2023 · 6 comments · Fixed by #675

Comments

@rajyan
Copy link
Contributor

rajyan commented May 12, 2023

We're currently using CarrierWave with fog-aws to upload a large number of files to S3 every day.
However, we sometimes encounter errors such as 500, 503, and Excon::Error::Socket intermittently.
We've found similar issues related to S3 retries on GitHub, such as #264 and carrierwaveuploader/carrierwave#2610.

After investigating the issue, we discovered that the fog storage upload request is configured as an idempotent request and is already retrying three four times using Excon, as explained in this comment. However, further investigation revealed that Excon seems not retrying on 500 nor 503 errors (if my understanding is correct) wasn't true. See the conversation below. Additionally, the default retry_interval of 0 seconds may be too short.

Therefore, we changed our Excon configuration for fog storage to the following:

{
  retry_limit: 5,
  retry_interval: 5,
  retry_errors: [
    Excon::Error::Timeout,
    Excon::Error::Socket,
    Excon::Error::HTTPStatus,
    Excon::Error::InternalServerError,
    Excon::Error::ServiceUnavailable,
  ]
}

This change greatly improved our S3 retries, and we haven't encountered an S3 error for a month with this configuration.

Do you think it would be reasonable to change the default configuration for fog storage to this?

@geemus
Copy link
Member

geemus commented May 12, 2023

Thanks for the detailed report.

I think the server errors (ie 500/503) are just a child to HTTPStatus errors, so that they would be retried already, see: https://github.com/excon/excon/blob/1e0c0ee3e975b2470262c6c882dccc752382e03c/lib/excon/error.rb#L83.

Bumping the interval and retries is probably fine, though a nicer solution might be something like exponential backoff (for some reason I kind of thought that is what we already did, but I must be misremembering as I can't find anything presently that looks like that).

Going from 4 to 5 retries seems fine, I guess bumping from 0 to 5 seconds on the retry seems like a much bigger jump. Could we start with something smaller, maybe 1 or 2? As I said above, I kind of feel like backoff might be ideal, but that is more involved in terms of implementation. What do you think?

@rajyan
Copy link
Contributor Author

rajyan commented May 12, 2023

I think the server errors (ie 500/503) are just a child to HTTPStatus errors, so that they would be retried already,

Oh, thanks for pointing out. You are right, I had missed it.
Then, the only relevant change should be retry_limit and retry_interval.

though a nicer solution might be something like exponential backoff

I totally agree.

Going from 4 to 5 retries seems fine, I guess bumping from 0 to 5 seconds on the retry seems like a much bigger jump.

I agree with you.

I have looked into retry mechanisms in other libraries for reference.

https://github.com/boto/botocore/blob/d95d441ef9b81ac4d0163fa97e736edf63a8b4b3/botocore/data/_retry.json#L219-L240
https://github.com/boto/botocore/blob/d95d441ef9b81ac4d0163fa97e736edf63a8b4b3/botocore/retryhandler.py#L43

  • base: rand (random value between 0 to 1)
  • growth_factor: 2
  • max_retry: 5

=> something like 0.5s, 1s, 1.5s 2s, 2.5s

https://github.com/aws/aws-sdk-js-v3/blob/864b34a52aea8dbd52a63239027b4c203c75863c/packages/util-retry/src/StandardRetryStrategy.ts#L1

  • DEFAULT_MAX_ATTEMPTS=3
  • DEFAULT_RETRY_DELAY_BASE=100ms

=> 0.1s, 0.2s, 0.4s retry

Could we start with something smaller, maybe 1 or 2?

I think 1s might be enough.

@rajyan
Copy link
Contributor Author

rajyan commented May 12, 2023

{
  retry_limit: 5,
  retry_interval: 1,
}

I'll try this config in advance!

@geemus
Copy link
Member

geemus commented May 15, 2023

Sounds good, just let me know if you need anything and what you find. Thanks!

rajyan pushed a commit to rajyan/fog-aws that referenced this issue May 15, 2023
@rajyan
Copy link
Contributor Author

rajyan commented May 15, 2023

This configuration seems working well too.
As a note, before these configurations, we were seeing about 5 errors/day (= Excon errors, which means they have retried with the default settings already) out of about 1 million request.
After setting the retry_limit and retry_interval, we haven't seen any errors for several months.

@geemus
Copy link
Member

geemus commented May 18, 2023

Sounds good, and thanks for sharing the context (I'm certainly not doing a million requests a day my self, so I had largely missed this error vector).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants