Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3manager should be able to upload files without buffering the contents #232

Closed
cespare opened this issue May 13, 2015 · 5 comments · Fixed by #250
Closed

s3manager should be able to upload files without buffering the contents #232

cespare opened this issue May 13, 2015 · 5 comments · Fixed by #250
Labels
feature-request A feature should be added or improved.

Comments

@cespare
Copy link
Contributor

cespare commented May 13, 2015

It's nice that s3manager can upload arbitrary io.Readers, but the common case is probably *os.File, and in this case the behavior of buffering entire chunks leads to unnecessary memory usage and garbage. (Even with perfect GC you'd expect s3manager to require Concurrency × PartSize for the live chunks.)

I wrote my own s3 uploader that's only for files; it uses an io.SectionReader for each chunk and avoids the buffering. In my test, using concurrency 10 and partsize 10MB, this change was the difference in steady-state memory usage of ~700MB and ~200MB.

In s3manager, we could check whether the provided io.Reader implements ReadAt and Seek, and if it does, do the no-copy logic I described. I can make a PR but I wanted to ask whether this seems like an idea you would want to incorporate first.

I put a public version of my uploader here in case that's helpful.

@lsegal
Copy link
Contributor

lsegal commented May 13, 2015

@cespare I would be open to this implementation. This came up in discussion today actually. If we can cast to a io.Seeker we should be able to use a seeking strategy when assigning parts to be sent.

@cespare
Copy link
Contributor Author

cespare commented May 13, 2015

I have a related question, too. (Maybe this should be a separate issue.) It seems that S3 won't accept multipart uploads of more than 10k parts, as I discovered when uploading some large files. So in my own uploader, I checked the total number of parts and if it's >10k, I bump the part size up such that there will be exactly 10k parts.

In s3manager, for arbitrary io.Readers it won't know the size, so I think uploads will fail if they're too large (larger than 50GB for the default PartSize of 5MB). This raises two questions:

  • After generating 10k parts, should s3manager immediately abort and report the failure to the user rather than trying to complete an invalid upload?
  • In the new proposed code path for Seekers, should we do the PartSize adjustment I did in my code? (A different option would be immediately reporting some "part size too small" error after discovering the size of the file.)

@lsegal
Copy link
Contributor

lsegal commented May 13, 2015

After generating 10k parts, should s3manager immediately abort and report the failure to the user rather than trying to complete an invalid upload?

A better idea would be to require the user to pass in an "estimated size" for arbitrary streams so we can scale the part size appropriately. The Ruby SDK does this, for example.

Failing the upload at 10k parts would be unhelpful, as the user would have no way to complete the upload and they would get stuck with having to restart from scratch. If we can fail early, it would save quite a bit of time.

In the new proposed code path for Seekers, should we do the PartSize adjustment I did in my code? (A different option would be immediately reporting some "part size too small" error after discovering the size of the file.)

Yes. We do this in Ruby and JS. This is an outstanding known issue in the s3manager implementation (mostly because we don't yet support Seekable streams).

@cespare
Copy link
Contributor Author

cespare commented May 13, 2015

Failing the upload at 10k parts would be unhelpful, as the user would have no way to complete the upload and they would get stuck with having to restart from scratch. If we can fail early, it would save quite a bit of time.

Well, failing and aborting at that point would be more helpful than allowing the user to upload another 50k chunks :) And this still might be a case we want to handle if the user doesn't provide an estimated size (or if the estimated size was very wrong).

@lsegal
Copy link
Contributor

lsegal commented May 13, 2015

Well, failing and aborting at that point would be more helpful than allowing the user to upload another 50k chunks

You are correct here-- we should check at the start in addition to failing at 10k.

And this still might be a case we want to handle if the user doesn't provide an estimated size

The idea would be that this would be a required field if the size is not computable, so it would not be possible to omit this.

lsegal added a commit that referenced this issue May 28, 2015
Improves memory consumption on io.ReaderAt streams (files, in memory
buffers) by avoiding creation of individual byte buffers.

Increasing concurrency threshold for seekable streams will no longer
impact memory usage.

Fixes #232
lsegal added a commit that referenced this issue May 29, 2015
The upload manager will now:

* error if the configured PartSize is smaller than MinUploadPartSize,
* adjust the part size if the upload will exceed MaxUploadParts,
* error if the manager cannot determine the total upload size and it
  has uploaded more than MaxUploadParts parts.

References #232
@diehlaws diehlaws added feature-request A feature should be added or improved. and removed enhancement labels Jan 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants