Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have s3_get_file use ETag to verify data not corrupted #216

Open
nickrobinson251 opened this issue Sep 29, 2021 · 6 comments
Open

Have s3_get_file use ETag to verify data not corrupted #216

nickrobinson251 opened this issue Sep 29, 2021 · 6 comments

Comments

@nickrobinson251
Copy link
Contributor

nickrobinson251 commented Sep 29, 2021

we have some code which async downloads files from S3 using s3_get_file, and often when fetching a large batch of files we find that 1 or 2 get truncated (sometimes containing no data but more often containing some subset of the data, making the file corrupt).

In case it's relevent, we see this most often when fetching 100s or 1000s of small (<100mb) JSON files. Although that could just be because this is happens ~0.1% of the time, and so we're more likely to notice it when fetching 1000 files at one than when fetching 10 files at once. But anyway, it doesn't seem to be connected to large filesizes (since we're seeing it on files ranging from 10kb - 10mb).

the code i've seen this happen with is written with asyncmap so looks roughly like

asyncmap(s3paths) do s3path
    local_filename = func(s3path)
    s3_get_file(s3path.bucket, s3path.key, local_filename)
    return local_filename
end

(but i'm unsure whether or not this being async affects the probablity of data being truncated)

Could s3_get_file verify the hash of the downloaded file is as expected, and retry if not?


S3 objects have an ETag as part of their metadata (e.g. it is returned as part of s3_list_objects), which we could use to check the downloaded data matched what's on S3. This "Etag" seem to be almost the same as a MD5 hash which i take makes it suitable to be used for checking the data is not corrupted.

https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html

Copying from those AWS docs on ETag:

The entity tag represents a specific version of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:

  • Objects created through the AWS Management Console or by the PUT Object, POST Object, or Copy operation:
    • Objects encrypted by SSE-S3 or plaintext have ETags that are an MD5 digest of their data.
    • Objects encrypted by SSE-C or SSE-KMS have ETags that are not an MD5 digest of their object data.
  • Objects created by either the Multipart Upload or Part Copy operation have ETags that are not MD5 digests, regardless of the method of encryption.

Type: String

@ericphanson
Copy link
Member

ericphanson commented Sep 29, 2021

written with asyncmap

are you using the downloads backend btw? I would definitely recommend that,

AWS.DEFAULT_BACKEND[] = AWS.DownloadsBackend()

otherwise you will likely run into HTTP.jl bugs (because the HTTP.jl client is not safe for concurrent use). Would be interesting to know if this is related to your issues or not as well.

Note: the DownloadsBackend should be pretty safe for concurrent use with @async (and asyncmap), but is not threadsafe (and neither is the HTTP.jl client)! See JuliaLang/Downloads.jl#110.

@nickrobinson251
Copy link
Contributor Author

are you using the downloads backend btw? ... the HTTP.jl client is not safe for concurrent use

No, but i suspect we can switch over to usse Downloads.jl.

Does AWS.jl have any docs about the "backends"? (I wonder how this interact with other packages which use AWS.jl?)

Thanks for the suggestion!

@ericphanson
Copy link
Member

ericphanson commented Sep 29, 2021

Does AWS.jl have any docs about the "backends"?

No, that's an open issue still: JuliaCloud/AWS.jl#426

I wonder how this interact with other packages which use AWS.jl

AWS.DEFAULT_BACKEND is a global change; setting it to DownloadsBackend() swaps out the default HTTP.jl-based client for a Downloads.jl-based client. This will affect everything using AWS.jl. For a more local change, you can pass it per-request. The autogenerated AWS API includes params arguments, and you add a "backend" => AWS.DownloadsBackend() parameter for example. However, AWSS3.jl's methods aren't written to pass general params along. Maybe every AWSS3 method should have a backend keyword argument that it uses whenever it's interacting with the AWS.jl API.

So as you can tell, there isn't a very smooth path here yet. (From my point of view, that's mostly because finding out how broken the HTTP.jl client is was a surprise, and as soon as I had a workaround with the DownloadsBackend I switched back to doing the thing I was actually trying to do, i.e. download some data concurrently for use in training a model, leaving unfortunately quite a bit of "future work" in terms of ergonomics and polish. Thankfully @christopher-dG fixed a bunch of my bugs and got tests passing made it actually mergable to AWS.jl-- though I was already using it at that point 😈).

(That being said, the DownloadsBackend passes all of AWS.jl's tests, and is tested in CI along with the HTTPBackend, and should work for any service, so changing it globally "should" be OK. Of course, bugs can and will exist).

@oxinabox
Copy link
Contributor

Regardless of back-end, and other things to decrease corruption.
Verifying not corrupted, is good.
We should do that.
Probably expose a few settings like: retry, warn, error, nocheck. idk what is the right default

@ericphanson
Copy link
Member

ericphanson commented Sep 29, 2021

Verifying not corrupted, is good.
We should do that.

💯

Probably expose a few settings like: retry, warn, error, nocheck. idk what is the right default

Maybe retry? And then error if we don't get it in a few retries. That seems to be what we do for most things that could be network error.

@oxinabox
Copy link
Contributor

Yeah, I think it might be worth implementing retry_forever which recalls this code with the same argument (and eventually stack-overflows)
And retry_once which calls the same function but with it set to error.
And retry_once is the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants