Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 uploads Reader vs. ReadSeeker (Chunked Upload?) #142

Closed
nathany opened this issue Mar 18, 2015 · 17 comments
Closed

S3 uploads Reader vs. ReadSeeker (Chunked Upload?) #142

nathany opened this issue Mar 18, 2015 · 17 comments
Labels
guidance Question that needs advice or information.

Comments

@nathany
Copy link

nathany commented Mar 18, 2015

On the develop branch both PutObjectInput and UploadPartInput require an io.ReadSeeker. Unfortunately I don't always have a ReadSeeker available, for example, when streaming the Request Body up to S3, which is a ReadCloser.

I saw there was some previous work on this in #87 and d3ffc81, but I don't see something like the aws.NonSeekable(reader) that @lsegal suggested. And actually, I don't want to perform a ReadAll just to calculate a signature, as the files may be quite large.

Since this is using V4 signing, it seems like Chunked Upload is necessary to make signing work without needing a Seeker.

Since I have the Content-Length, it seems like I should be able to decide between PutObject and MultipartUpload based on size (eg. if > 100 MB, use multipart). Either should work.

Happy to contribute where I can. Is this something that would be possible, considered and desired?

@nathany nathany changed the title S3 uploads Read vs. ReadSeeker (Chunked Upload?) S3 uploads Reader vs. ReadSeeker (Chunked Upload?) Mar 18, 2015
@nathany
Copy link
Author

nathany commented Mar 18, 2015

related: #81 (comment)

@lsegal
Copy link
Contributor

lsegal commented Mar 18, 2015

The NonSeekable work has not yet been started. This is something we're going to start looking at once internals are refactored. We will also be able to look at other improvements for S3 that will make code more performant in these cases.

It's likely you will still need to pass an io.ReadSeeker at the low-level API call, but there will be a wrapper for this, likely something more akin to a ReadSeekCloser. This is because we will probably not be implementing chunked upload immediately if we can avoid it. Chunked uploading adds a significant amount of complexity that in practice is not often necessary (see below for why not).

Since I have the Content-Length, it seems like I should be able to decide between PutObject and MultipartUpload based on size

Yes. You should actually be making this decision at the 5MB mark, since this is the minimum allowed size for a multipart upload. This is a feature we typically add on top of the core API to our SDKs known as the S3 transfer manager, or upload manager. We plan on adding this to the Go SDK as well. Once this functionality is added, the SDK will be able to accept non-seekable streams for this operation and decide based on the buffered input length whether a multi-part upload is possible (i.e., if the payload is >5MB).

Multipart uploads are typically always preferred over simple PutObjects for a few reasons:

  1. Each individual part can be retried and recovered independently, so a bad connection 75% way through your upload won't cause a retry of all 75% of the file, just of the current part.
  2. More importantly, each individual part can be sent in parallel (using a pool) which allows you to utilize full throughput and drastically improves upload performance in our tests across SDKs.

To go back to the chunked upload point: once you're using multipart uploads, you don't need to worry about chunked signing, since you're doing the chunking as part of the UploadPart() operation locally. Once you have an individual part to upload, you likely have that data in memory and can seek on it as desired.

Basically, the plan going forward will be to implement a managed uploader. That should alleviate most of the issues around signing, seekable streams, and other issues while also making the operation much more robust and performant.

@lsegal
Copy link
Contributor

lsegal commented Mar 18, 2015

Also, if you want to help out and do end up writing something similar to what was described above, we'd be happy to look at it if you sent it back to us as a PR. You can look at some of the other SDKs for reference implementations. The Java and JS implementations are linked to above, but we have others, and I can track them down if you're interested.

@nathany
Copy link
Author

nathany commented Mar 18, 2015

Thanks Loren.

The 100 MB figure comes from this statement in the AWS documentation:

We encourage Amazon S3 customers to use Multipart Upload for objects greater than 100 MB.

It's the only recommendation I had found before.

I did some experiments with Multipart uploads last month, but using https://github.com/kr/s3 rather than this library. In my tests, I actually didn't see any issues from AWS when uploading a single-part file < 5 MB.

However, I was buffering 5 MB parts, not for V4 signing, but because I didn't have the Content-Length at the time. 5 MB buffers is much more than I would like. (5 MB * 200 concurrent uploads, where uploads can take a few minutes = 1 GB)

I don't understand how this upload manager will make non-seekable streams possible? Unless the intention is to buffer 5 MB parts to do the V4 signing calculation? A ReadSeeker could avoid that overhead by reading through the data twice instead, right?

To receive the benefits of parallel uploads and recovery, I would need to write the request.Body to disk and then read it back again to get a ReadSeeker. Writing a temp file to disk is something I've been trying to avoid so far, but perhaps I need to re-evaluate that decision (at least for larger files).

@nathany
Copy link
Author

nathany commented Mar 18, 2015

I agree that chunked uploads are complex, but a chunk size of 8 KB or even 64 KB is much more appealing than a 5 MB part size.

If chunked uploads are added to the upload manager later, how would that be done? Would there be a chunked variant of PutObject? Or of the aws request Send?

@lsegal
Copy link
Contributor

lsegal commented Mar 19, 2015

5 MB buffers is much more than I would like. (5 MB * 200 concurrent uploads, where uploads can take a few minutes = 1 GB)

Buffer sizes would be configurable as they are in other SDKs.

I don't understand how this upload manager will make non-seekable streams possible? Unless the intention is to buffer 5 MB parts to do the V4 signing calculation?

Exactly. Note that buffering in this solution is not done for signing, but for the general problem of not knowing exactly how much data needs to be sent in the multipart upload (the "content length" problem), or if you're even able to send a multipart in the first place-- and of course, retries. You need to pick a buffer size and buffer into memory/disk either way if you want any of these benefits. The fact that you now have a complete buffered object just makes signing all that much easier.

Chunked signing could reduce memory overhead somewhat by streaming as you send the multiple parts, but you still need to do at least 5mb of buffering to determine if you are sending a single part or not, and you would lose the ability to retry, which is an extremely commonly requested feature, especially as your files get larger. If memory use is still a concern, an option to buffer to disk would still be a much more manageable implementation than chunked signing, and that could be supported by the SDK. With prevalence of SSD these days, the cost would not be terribly significant (your bottleneck would still be the network).

To receive the benefits of parallel uploads and recovery, I would need to write the request.Body to disk and then read it back again to get a ReadSeeker.

You could buffer into memory as well.

@nathany
Copy link
Author

nathany commented Mar 19, 2015

Good point with regards to SSDs. You've given me a fair bit to chew on. Thanks.

@lsegal lsegal added guidance Question that needs advice or information. develop labels Mar 23, 2015
@lsegal
Copy link
Contributor

lsegal commented Mar 26, 2015

Develop was merged into master today and with it comes a few changes. One of the changes was an aws.ReadSeekCloser(io.Reader) operation which is similar to NopCloser() except it passes any operations through to the underlying type (Read, Seek, and Close) if the underlying type responds. This should allow non-seekable types to be passed in, but note that they won't sign correctly. There probably needs to be more investigation into how that should be handled (if we should even explicitly disallow this), but based on the conversation above, it seems as though all payload operations will require seeking going forward at the client level. If that decision is fair, I think this issue can be closed.

@lsegal lsegal added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed develop labels Mar 26, 2015
@nathany
Copy link
Author

nathany commented Mar 26, 2015

That seems fair. Thanks.

@nathany nathany closed this as completed Mar 26, 2015
lsegal added a commit that referenced this issue Apr 30, 2015
This new package supports managed uploads to S3 through the
s3manager.Upload() function, which intelligently switches to a
multipart upload when it exceeds a certain part size. In this case,
the upload will be performed concurrently across a configurable
number of goroutines. The Upload() operation also allows for
non-seekable Readers to be passed in, effectively allowing
streaming uploads of unbounded payloads to S3.

Sample usage:

	func main() {
		fd, err := os.Open("path/to/file")
		if err != nil {
			panic(err)
		}

		// Create and S3 object and upload the file.
		svc := s3.New(nil)
		resp, err := s3manager.Upload(svc, &s3manager.UploadInput{
			Bucket: aws.String("bucket"),
			Key:    aws.String("path/to/file"),
			Body:   fd,
		}, nil)
		if err != nil {
			panic(err)
		}

		fmt.Println(awsutil.StringValue(resp))
	}

References #142, #207
@lsegal
Copy link
Contributor

lsegal commented Apr 30, 2015

@nathany you can check out the s3/s3manager package that landed above for streaming upload support with bare io.Readers.

@nathany
Copy link
Author

nathany commented May 5, 2015

Glad to hear that's landed, but sorry, I don't have time at the moment.

I ended up writing my own little S3 library that uses v2 signing and is specific for the purpose I need (which is perhaps a bit unusual).

@c4milo
Copy link

c4milo commented Oct 31, 2016

@lsegal sorry for commenting on a closed issue but it wasn't totally clear for me what exactly we are loosing by using the aws.ReadSeekCloser(io.Reader) trick. As far as I understand we lose the following:

  • Request signing
  • Retries

Is there anything else?

@c4milo
Copy link

c4milo commented Nov 1, 2016

wrapping aws.ReadSeekCloser(io.Reader) doesn't seem to work:

failed uploading part 1 to S3: XAmzContentSHA256Mismatch: The provided 'x-amz-content-sha256' header does not match what was computed.
    status code: 400, request id: 7F90A244BC4B3C98

@c4milo
Copy link

c4milo commented Nov 1, 2016

I'm giving up on using this SDK, it was not thought out to stream multipart requests from client -> server -> S3 and requires buffering an entire file in the server in order to send it to S3, makes no sense.

@jasdel
Copy link
Contributor

jasdel commented Nov 1, 2016

@c4milo sorry to hear that. In #915 there is a discussion on how the s3manager.Uploader can simplify this process for you, and not require the fill file's contents to be in memory.

@c4milo
Copy link

c4milo commented Nov 1, 2016

Ok, I went back and implemented it using s3manager it works as expected for uploading, memory usage is very stable too, unlike https://github.com/rlmcpherson/s3gof3r which seemed to keep allocating memory every time I did an upload.

For downloading I had to use the regular S3 API as http.ResponseWriter does not implement io.WriterAt. But, that is not a problem for my use case.

Thanks for @jasdel for commenting back!

@jasdel
Copy link
Contributor

jasdel commented Nov 1, 2016

Glad that solution worked for you @c4milo! let us know if you have any additional questions, feedback, or issues.

Agreed, for the download streaming case the vanilla GetObject is the best tool for streaming download to another source.

One thing that might be helpful in your use case is for your service to vend temporary pre-signed PutObject URLs to the clients. and the client uploads the content your S3 bucket it self. In addition for download you could use the similar pattern but redirect the client to a pre-signed S3 download URL. If your use case doesn't require modification of the content of the stream and have control of how the clients perform the upload this pattern would help reduce the load on your service.

@diehlaws diehlaws removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance Question that needs advice or information.
Projects
None yet
Development

No branches or pull requests

5 participants