S3 upload retrying #674

rajyan · 2023-05-12T10:34:36Z

We're currently using CarrierWave with fog-aws to upload a large number of files to S3 every day.
However, we sometimes encounter errors such as 500, 503, and Excon::Error::Socket intermittently.
We've found similar issues related to S3 retries on GitHub, such as #264 and carrierwaveuploader/carrierwave#2610.

After investigating the issue, we discovered that the fog storage upload request is configured as an idempotent request and is already retrying ~~three~~ four times using Excon, as explained in this comment. However, ~~further investigation revealed that Excon seems not retrying on 500 nor 503 errors (if my understanding is correct)~~ wasn't true. See the conversation below. Additionally, the default retry_interval of 0 seconds may be too short.

Therefore, we changed our Excon configuration for fog storage to the following:

{
  retry_limit: 5,
  retry_interval: 5,
  retry_errors: [
    Excon::Error::Timeout,
    Excon::Error::Socket,
    Excon::Error::HTTPStatus,
    Excon::Error::InternalServerError,
    Excon::Error::ServiceUnavailable,
  ]
}

This change greatly improved our S3 retries, and we haven't encountered an S3 error for a month with this configuration.

Do you think it would be reasonable to change the default configuration for fog storage to this?

The text was updated successfully, but these errors were encountered:

geemus · 2023-05-12T14:37:38Z

Thanks for the detailed report.

I think the server errors (ie 500/503) are just a child to HTTPStatus errors, so that they would be retried already, see: https://github.com/excon/excon/blob/1e0c0ee3e975b2470262c6c882dccc752382e03c/lib/excon/error.rb#L83.

Bumping the interval and retries is probably fine, though a nicer solution might be something like exponential backoff (for some reason I kind of thought that is what we already did, but I must be misremembering as I can't find anything presently that looks like that).

Going from 4 to 5 retries seems fine, I guess bumping from 0 to 5 seconds on the retry seems like a much bigger jump. Could we start with something smaller, maybe 1 or 2? As I said above, I kind of feel like backoff might be ideal, but that is more involved in terms of implementation. What do you think?

rajyan · 2023-05-12T22:35:51Z

I think the server errors (ie 500/503) are just a child to HTTPStatus errors, so that they would be retried already,

Oh, thanks for pointing out. You are right, I had missed it.
Then, the only relevant change should be retry_limit and retry_interval.

though a nicer solution might be something like exponential backoff

I totally agree.

Going from 4 to 5 retries seems fine, I guess bumping from 0 to 5 seconds on the retry seems like a much bigger jump.

I agree with you.

I have looked into retry mechanisms in other libraries for reference.

https://github.com/boto/botocore/blob/d95d441ef9b81ac4d0163fa97e736edf63a8b4b3/botocore/data/_retry.json#L219-L240
https://github.com/boto/botocore/blob/d95d441ef9b81ac4d0163fa97e736edf63a8b4b3/botocore/retryhandler.py#L43

base: rand (random value between 0 to 1)
growth_factor: 2
max_retry: 5

=> something like 0.5s, 1s, 1.5s 2s, 2.5s

https://github.com/aws/aws-sdk-js-v3/blob/864b34a52aea8dbd52a63239027b4c203c75863c/packages/util-retry/src/StandardRetryStrategy.ts#L1

DEFAULT_MAX_ATTEMPTS=3
DEFAULT_RETRY_DELAY_BASE=100ms

=> 0.1s, 0.2s, 0.4s retry

Could we start with something smaller, maybe 1 or 2?

I think 1s might be enough.

rajyan · 2023-05-12T22:38:57Z

{
  retry_limit: 5,
  retry_interval: 1,
}

I'll try this config in advance!

geemus · 2023-05-15T19:26:09Z

Sounds good, just let me know if you need anything and what you find. Thanks!

see fog#674

rajyan · 2023-05-15T23:15:31Z

This configuration seems working well too.
As a note, before these configurations, we were seeing about 5 errors/day (= Excon errors, which means they have retried with the default settings already) out of about 1 million request.
After setting the retry_limit and retry_interval, we haven't seen any errors for several months.

geemus · 2023-05-18T12:43:53Z

Sounds good, and thanks for sharing the context (I'm certainly not doing a million requests a day my self, so I had largely missed this error vector).

rajyan pushed a commit to rajyan/fog-aws that referenced this issue May 15, 2023

configure connection_options to improve s3 retry

5c0745c

see fog#674

rajyan mentioned this issue May 15, 2023

configure connection_options to improve s3 retry #675

Merged

geemus closed this as completed in #675 May 18, 2023

danielbatallaq mentioned this issue Aug 4, 2023

Rails 7 - slow store time Carrierwave S3 AWS fog upload ONLY in localhost carrierwaveuploader/carrierwave#2605

Open

This was referenced Sep 25, 2023

Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses #690

Closed

Fog::AWS::Storage don't retry client errors #691

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 upload retrying #674

S3 upload retrying #674

rajyan commented May 12, 2023 •

edited

Loading

geemus commented May 12, 2023

rajyan commented May 12, 2023 •

edited

Loading

rajyan commented May 12, 2023

geemus commented May 15, 2023

rajyan commented May 15, 2023

geemus commented May 18, 2023

S3 upload retrying #674

S3 upload retrying #674

Comments

rajyan commented May 12, 2023 • edited Loading

geemus commented May 12, 2023

rajyan commented May 12, 2023 • edited Loading

rajyan commented May 12, 2023

geemus commented May 15, 2023

rajyan commented May 15, 2023

geemus commented May 18, 2023

rajyan commented May 12, 2023 •

edited

Loading

rajyan commented May 12, 2023 •

edited

Loading