Have `s3_get_file` use `ETag` to verify data not corrupted #216

nickrobinson251 · 2021-09-29T15:57:20Z

we have some code which async downloads files from S3 using s3_get_file, and often when fetching a large batch of files we find that 1 or 2 get truncated (sometimes containing no data but more often containing some subset of the data, making the file corrupt).

In case it's relevent, we see this most often when fetching 100s or 1000s of small (<100mb) JSON files. Although that could just be because this is happens ~0.1% of the time, and so we're more likely to notice it when fetching 1000 files at one than when fetching 10 files at once. But anyway, it doesn't seem to be connected to large filesizes (since we're seeing it on files ranging from 10kb - 10mb).

the code i've seen this happen with is written with asyncmap so looks roughly like

asyncmap(s3paths) do s3path
    local_filename = func(s3path)
    s3_get_file(s3path.bucket, s3path.key, local_filename)
    return local_filename
end

(but i'm unsure whether or not this being async affects the probablity of data being truncated)

Could s3_get_file verify the hash of the downloaded file is as expected, and retry if not?

S3 objects have an ETag as part of their metadata (e.g. it is returned as part of s3_list_objects), which we could use to check the downloaded data matched what's on S3. This "Etag" seem to be almost the same as a MD5 hash which i take makes it suitable to be used for checking the data is not corrupted.

https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html

Copying from those AWS docs on ETag:

The entity tag represents a specific version of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:

Objects created through the AWS Management Console or by the PUT Object, POST Object, or Copy operation:

Objects encrypted by SSE-S3 or plaintext have ETags that are an MD5 digest of their data.

Objects encrypted by SSE-C or SSE-KMS have ETags that are not an MD5 digest of their object data.

Objects created by either the Multipart Upload or Part Copy operation have ETags that are not MD5 digests, regardless of the method of encryption.

Type: String

The text was updated successfully, but these errors were encountered:

ericphanson · 2021-09-29T16:56:32Z

written with asyncmap

are you using the downloads backend btw? I would definitely recommend that,

AWS.DEFAULT_BACKEND[] = AWS.DownloadsBackend()

otherwise you will likely run into HTTP.jl bugs (because the HTTP.jl client is not safe for concurrent use). Would be interesting to know if this is related to your issues or not as well.

Note: the DownloadsBackend should be pretty safe for concurrent use with @async (and asyncmap), but is not threadsafe (and neither is the HTTP.jl client)! See JuliaLang/Downloads.jl#110.

nickrobinson251 · 2021-09-29T17:14:39Z

are you using the downloads backend btw? ... the HTTP.jl client is not safe for concurrent use

No, but i suspect we can switch over to usse Downloads.jl.

Does AWS.jl have any docs about the "backends"? (I wonder how this interact with other packages which use AWS.jl?)

Thanks for the suggestion!

ericphanson · 2021-09-29T17:24:14Z

Does AWS.jl have any docs about the "backends"?

No, that's an open issue still: JuliaCloud/AWS.jl#426

I wonder how this interact with other packages which use AWS.jl

AWS.DEFAULT_BACKEND is a global change; setting it to DownloadsBackend() swaps out the default HTTP.jl-based client for a Downloads.jl-based client. This will affect everything using AWS.jl. For a more local change, you can pass it per-request. The autogenerated AWS API includes params arguments, and you add a "backend" => AWS.DownloadsBackend() parameter for example. However, AWSS3.jl's methods aren't written to pass general params along. Maybe every AWSS3 method should have a backend keyword argument that it uses whenever it's interacting with the AWS.jl API.

So as you can tell, there isn't a very smooth path here yet. (From my point of view, that's mostly because finding out how broken the HTTP.jl client is was a surprise, and as soon as I had a workaround with the DownloadsBackend I switched back to doing the thing I was actually trying to do, i.e. download some data concurrently for use in training a model, leaving unfortunately quite a bit of "future work" in terms of ergonomics and polish. Thankfully @christopher-dG fixed a bunch of my bugs and got tests passing made it actually mergable to AWS.jl-- though I was already using it at that point 😈).

(That being said, the DownloadsBackend passes all of AWS.jl's tests, and is tested in CI along with the HTTPBackend, and should work for any service, so changing it globally "should" be OK. Of course, bugs can and will exist).

oxinabox · 2021-09-29T17:43:14Z

Regardless of back-end, and other things to decrease corruption.
Verifying not corrupted, is good.
We should do that.
Probably expose a few settings like: retry, warn, error, nocheck. idk what is the right default

ericphanson · 2021-09-29T17:46:01Z

Verifying not corrupted, is good.
We should do that.

💯

Probably expose a few settings like: retry, warn, error, nocheck. idk what is the right default

Maybe retry? And then error if we don't get it in a few retries. That seems to be what we do for most things that could be network error.

oxinabox · 2021-09-29T18:49:19Z

Yeah, I think it might be worth implementing retry_forever which recalls this code with the same argument (and eventually stack-overflows)
And retry_once which calls the same function but with it set to error.
And retry_once is the default.

nickrobinson251 mentioned this issue Sep 29, 2021

Ability to specify which AWS.AbstractBackend to use on a per-call basis #217

Open

ericphanson mentioned this issue Nov 14, 2021

S3.get_object sometimes returning incomplete ranged data without error JuliaCloud/AWS.jl#515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have `s3_get_file` use `ETag` to verify data not corrupted #216

Have `s3_get_file` use `ETag` to verify data not corrupted #216

nickrobinson251 commented Sep 29, 2021 •

edited

Loading

ericphanson commented Sep 29, 2021 •

edited

Loading

nickrobinson251 commented Sep 29, 2021

ericphanson commented Sep 29, 2021 •

edited

Loading

oxinabox commented Sep 29, 2021

ericphanson commented Sep 29, 2021 •

edited

Loading

oxinabox commented Sep 29, 2021

Have s3_get_file use ETag to verify data not corrupted #216

Have s3_get_file use ETag to verify data not corrupted #216

Comments

nickrobinson251 commented Sep 29, 2021 • edited Loading

ericphanson commented Sep 29, 2021 • edited Loading

nickrobinson251 commented Sep 29, 2021

ericphanson commented Sep 29, 2021 • edited Loading

oxinabox commented Sep 29, 2021

ericphanson commented Sep 29, 2021 • edited Loading

oxinabox commented Sep 29, 2021

Have `s3_get_file` use `ETag` to verify data not corrupted #216

Have `s3_get_file` use `ETag` to verify data not corrupted #216

nickrobinson251 commented Sep 29, 2021 •

edited

Loading

ericphanson commented Sep 29, 2021 •

edited

Loading

ericphanson commented Sep 29, 2021 •

edited

Loading

ericphanson commented Sep 29, 2021 •

edited

Loading