-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 uploads Reader vs. ReadSeeker (Chunked Upload?) #142
Comments
related: #81 (comment) |
The It's likely you will still need to pass an
Yes. You should actually be making this decision at the 5MB mark, since this is the minimum allowed size for a multipart upload. This is a feature we typically add on top of the core API to our SDKs known as the S3 transfer manager, or upload manager. We plan on adding this to the Go SDK as well. Once this functionality is added, the SDK will be able to accept non-seekable streams for this operation and decide based on the buffered input length whether a multi-part upload is possible (i.e., if the payload is >5MB). Multipart uploads are typically always preferred over simple PutObjects for a few reasons:
To go back to the chunked upload point: once you're using multipart uploads, you don't need to worry about chunked signing, since you're doing the chunking as part of the UploadPart() operation locally. Once you have an individual part to upload, you likely have that data in memory and can seek on it as desired. Basically, the plan going forward will be to implement a managed uploader. That should alleviate most of the issues around signing, seekable streams, and other issues while also making the operation much more robust and performant. |
Also, if you want to help out and do end up writing something similar to what was described above, we'd be happy to look at it if you sent it back to us as a PR. You can look at some of the other SDKs for reference implementations. The Java and JS implementations are linked to above, but we have others, and I can track them down if you're interested. |
Thanks Loren. The 100 MB figure comes from this statement in the AWS documentation:
It's the only recommendation I had found before. I did some experiments with Multipart uploads last month, but using https://github.com/kr/s3 rather than this library. In my tests, I actually didn't see any issues from AWS when uploading a single-part file < 5 MB. However, I was buffering 5 MB parts, not for V4 signing, but because I didn't have the Content-Length at the time. 5 MB buffers is much more than I would like. (5 MB * 200 concurrent uploads, where uploads can take a few minutes = 1 GB) I don't understand how this upload manager will make non-seekable streams possible? Unless the intention is to buffer 5 MB parts to do the V4 signing calculation? A ReadSeeker could avoid that overhead by reading through the data twice instead, right? To receive the benefits of parallel uploads and recovery, I would need to write the request.Body to disk and then read it back again to get a ReadSeeker. Writing a temp file to disk is something I've been trying to avoid so far, but perhaps I need to re-evaluate that decision (at least for larger files). |
I agree that chunked uploads are complex, but a chunk size of 8 KB or even 64 KB is much more appealing than a 5 MB part size. If chunked uploads are added to the upload manager later, how would that be done? Would there be a chunked variant of PutObject? Or of the aws request Send? |
Buffer sizes would be configurable as they are in other SDKs.
Exactly. Note that buffering in this solution is not done for signing, but for the general problem of not knowing exactly how much data needs to be sent in the multipart upload (the "content length" problem), or if you're even able to send a multipart in the first place-- and of course, retries. You need to pick a buffer size and buffer into memory/disk either way if you want any of these benefits. The fact that you now have a complete buffered object just makes signing all that much easier. Chunked signing could reduce memory overhead somewhat by streaming as you send the multiple parts, but you still need to do at least 5mb of buffering to determine if you are sending a single part or not, and you would lose the ability to retry, which is an extremely commonly requested feature, especially as your files get larger. If memory use is still a concern, an option to buffer to disk would still be a much more manageable implementation than chunked signing, and that could be supported by the SDK. With prevalence of SSD these days, the cost would not be terribly significant (your bottleneck would still be the network).
You could buffer into memory as well. |
Good point with regards to SSDs. You've given me a fair bit to chew on. Thanks. |
Develop was merged into master today and with it comes a few changes. One of the changes was an |
That seems fair. Thanks. |
This new package supports managed uploads to S3 through the s3manager.Upload() function, which intelligently switches to a multipart upload when it exceeds a certain part size. In this case, the upload will be performed concurrently across a configurable number of goroutines. The Upload() operation also allows for non-seekable Readers to be passed in, effectively allowing streaming uploads of unbounded payloads to S3. Sample usage: func main() { fd, err := os.Open("path/to/file") if err != nil { panic(err) } // Create and S3 object and upload the file. svc := s3.New(nil) resp, err := s3manager.Upload(svc, &s3manager.UploadInput{ Bucket: aws.String("bucket"), Key: aws.String("path/to/file"), Body: fd, }, nil) if err != nil { panic(err) } fmt.Println(awsutil.StringValue(resp)) } References #142, #207
@nathany you can check out the |
Glad to hear that's landed, but sorry, I don't have time at the moment. I ended up writing my own little S3 library that uses v2 signing and is specific for the purpose I need (which is perhaps a bit unusual). |
@lsegal sorry for commenting on a closed issue but it wasn't totally clear for me what exactly we are loosing by using the
Is there anything else? |
wrapping
|
I'm giving up on using this SDK, it was not thought out to stream multipart requests from client -> server -> S3 and requires buffering an entire file in the server in order to send it to S3, makes no sense. |
Ok, I went back and implemented it using For downloading I had to use the regular S3 API as Thanks for @jasdel for commenting back! |
Glad that solution worked for you @c4milo! let us know if you have any additional questions, feedback, or issues. Agreed, for the download streaming case the vanilla One thing that might be helpful in your use case is for your service to vend temporary pre-signed PutObject URLs to the clients. and the client uploads the content your S3 bucket it self. In addition for download you could use the similar pattern but redirect the client to a pre-signed S3 download URL. If your use case doesn't require modification of the content of the stream and have control of how the clients perform the upload this pattern would help reduce the load on your service. |
On the develop branch both
PutObjectInput
andUploadPartInput
require anio.ReadSeeker
. Unfortunately I don't always have a ReadSeeker available, for example, when streaming the Request Body up to S3, which is a ReadCloser.I saw there was some previous work on this in #87 and d3ffc81, but I don't see something like the
aws.NonSeekable(reader)
that @lsegal suggested. And actually, I don't want to perform a ReadAll just to calculate a signature, as the files may be quite large.Since this is using V4 signing, it seems like Chunked Upload is necessary to make signing work without needing a Seeker.
Since I have the Content-Length, it seems like I should be able to decide between PutObject and MultipartUpload based on size (eg. if > 100 MB, use multipart). Either should work.
Happy to contribute where I can. Is this something that would be possible, considered and desired?
The text was updated successfully, but these errors were encountered: