Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider converting io.ReadSeeker arguments in S3 package to io.Reader #272

Closed
DavidHuie opened this issue Jun 7, 2015 · 11 comments
Closed
Labels
guidance Question that needs advice or information.

Comments

@DavidHuie
Copy link

Forcing users to use io.ReadSeeker is not a good idea because most reads are done in Go with the io.Reader interface. If someone is already using an io.Reader, they would have to read the entire of buffer into memory in order to make an io.ReadSeeker. I noticed that other S3 libraries use io.Reader interfaces, such as https://github.com/mitchellh/goamz.

Please consider this because I cannot use this package in my own applications due to these memory concerns.

@marcosnils
Copy link
Contributor

@DavidHuie I had the same concern while trying to upload some content to S3. If you look into the source code the library uses a ReadSeeker to calculate to content MD5 (

_, err = r.Body.Seek(0, 0)
) which is about to send. It doesn't matter if the interface is changed to an io.Reader because either way the library will still read the whole content to do this calculation. It's cleaner from an API perspective to use ReadSeeker as resetting the buffer is provided by the Seeker API.

@lsegal
Copy link
Contributor

lsegal commented Jun 7, 2015

If someone is already using an io.Reader, they would have to read the entire of buffer into memory in order to make an io.ReadSeeker.

This is not exactly true. That said, you're probably looking for the s3manager.Uploader abstraction. You can read more about it in our Getting Started Guide. We will be doing a better job of advertising these features as we bootstrap our documentation.

@lsegal
Copy link
Contributor

lsegal commented Jun 7, 2015

If you look into the source code the library uses a ReadSeeker to calculate to content MD5

@marcosnils actually just a minor pedantic correction, but Content-MD5 is not calculated for PutObject. That said, we do calculate checksums in the payload signature logic: https://github.com/aws/aws-sdk-go/blob/master/internal/signer/v4/v4.go#L311 -- so, basically the same thing. We also need a seekable body so that we can rewind it in case of retries, so there are multiple reasons why a seekable reader is required, and you correctly touched on one of them.

@DavidHuie
Copy link
Author

The Uploader abstraction doesn't quite fit my usecase because I'm managing my own uploads via the multipart API (the server is proxying a large upload for a client). Would it be possible to create a similar abstraction for uploading a multipart part?

@lsegal
Copy link
Contributor

lsegal commented Jun 7, 2015

@DavidHuie the Uploader performs multi-part uploads.

@lsegal lsegal added the guidance Question that needs advice or information. label Jun 7, 2015
@marcosnils
Copy link
Contributor

@lsegal totally right. I knew that some sort of hash was calculated because I looked in the code before. Just got confused between MD5 and Checksum.

Thanks for the clarification.

@DavidHuie
Copy link
Author

Correct, but in my usecase the server never has the entire payload -- it's on the client. The server, which would use this library, would be proxying for the multipart API.

@lsegal
Copy link
Contributor

lsegal commented Jun 7, 2015

@DavidHuie if you're calling UploadPart directly then you must provide a seekable stream. This is required for SHA-256 checksumming in the signature layer as well as for retry support which is integral to the SDK's feature-set. There are a couple of ways to go about this-- if you are concerned about memory footprint of each part, you may consider caching each part uploaded from the client on disk. If your application server is backed by some amount of SSD filesystem access, the I/O cost should be fairly low.

@lsegal
Copy link
Contributor

lsegal commented Jun 7, 2015

You might also want to check #142 which has some more information about this specific issue.

@DavidHuie
Copy link
Author

@lsegal The cacheing on disk solution is probably good enough for my purposes. Thanks for the quick response.

@lsegal
Copy link
Contributor

lsegal commented Jun 7, 2015

No problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance Question that needs advice or information.
Projects
None yet
Development

No branches or pull requests

3 participants