-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have s3_get_file
use ETag
to verify data not corrupted
#216
Comments
are you using the downloads backend btw? I would definitely recommend that, AWS.DEFAULT_BACKEND[] = AWS.DownloadsBackend() otherwise you will likely run into HTTP.jl bugs (because the HTTP.jl client is not safe for concurrent use). Would be interesting to know if this is related to your issues or not as well. Note: the DownloadsBackend should be pretty safe for concurrent use with |
No, but i suspect we can switch over to usse Downloads.jl. Does AWS.jl have any docs about the "backends"? (I wonder how this interact with other packages which use AWS.jl?) Thanks for the suggestion! |
No, that's an open issue still: JuliaCloud/AWS.jl#426
So as you can tell, there isn't a very smooth path here yet. (From my point of view, that's mostly because finding out how broken the HTTP.jl client is was a surprise, and as soon as I had a workaround with the DownloadsBackend I switched back to doing the thing I was actually trying to do, i.e. download some data concurrently for use in training a model, leaving unfortunately quite a bit of "future work" in terms of ergonomics and polish. Thankfully @christopher-dG fixed a bunch of my bugs and got tests passing made it actually mergable to AWS.jl-- though I was already using it at that point 😈). (That being said, the DownloadsBackend passes all of AWS.jl's tests, and is tested in CI along with the HTTPBackend, and should work for any service, so changing it globally "should" be OK. Of course, bugs can and will exist). |
Regardless of back-end, and other things to decrease corruption. |
💯
Maybe retry? And then error if we don't get it in a few retries. That seems to be what we do for most things that could be network error. |
Yeah, I think it might be worth implementing |
we have some code which async downloads files from S3 using
s3_get_file
, and often when fetching a large batch of files we find that 1 or 2 get truncated (sometimes containing no data but more often containing some subset of the data, making the file corrupt).In case it's relevent, we see this most often when fetching 100s or 1000s of small (<100mb) JSON files. Although that could just be because this is happens ~0.1% of the time, and so we're more likely to notice it when fetching 1000 files at one than when fetching 10 files at once. But anyway, it doesn't seem to be connected to large filesizes (since we're seeing it on files ranging from 10kb - 10mb).
the code i've seen this happen with is written with
asyncmap
so looks roughly like(but i'm unsure whether or not this being async affects the probablity of data being truncated)
Could
s3_get_file
verify the hash of the downloaded file is as expected, and retry if not?S3 objects have an
ETag
as part of their metadata (e.g. it is returned as part ofs3_list_objects
), which we could use to check the downloaded data matched what's on S3. This "Etag" seem to be almost the same as a MD5 hash which i take makes it suitable to be used for checking the data is not corrupted.https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
Copying from those AWS docs on
ETag
:The text was updated successfully, but these errors were encountered: