s3manager should be able to upload files without buffering the contents #232

cespare · 2015-05-13T02:08:06Z

It's nice that s3manager can upload arbitrary io.Readers, but the common case is probably *os.File, and in this case the behavior of buffering entire chunks leads to unnecessary memory usage and garbage. (Even with perfect GC you'd expect s3manager to require Concurrency × PartSize for the live chunks.)

I wrote my own s3 uploader that's only for files; it uses an io.SectionReader for each chunk and avoids the buffering. In my test, using concurrency 10 and partsize 10MB, this change was the difference in steady-state memory usage of ~700MB and ~200MB.

In s3manager, we could check whether the provided io.Reader implements ReadAt and Seek, and if it does, do the no-copy logic I described. I can make a PR but I wanted to ask whether this seems like an idea you would want to incorporate first.

I put a public version of my uploader here in case that's helpful.

The text was updated successfully, but these errors were encountered:

lsegal · 2015-05-13T02:12:21Z

@cespare I would be open to this implementation. This came up in discussion today actually. If we can cast to a io.Seeker we should be able to use a seeking strategy when assigning parts to be sent.

cespare · 2015-05-13T02:31:25Z

I have a related question, too. (Maybe this should be a separate issue.) It seems that S3 won't accept multipart uploads of more than 10k parts, as I discovered when uploading some large files. So in my own uploader, I checked the total number of parts and if it's >10k, I bump the part size up such that there will be exactly 10k parts.

In s3manager, for arbitrary io.Readers it won't know the size, so I think uploads will fail if they're too large (larger than 50GB for the default PartSize of 5MB). This raises two questions:

After generating 10k parts, should s3manager immediately abort and report the failure to the user rather than trying to complete an invalid upload?
In the new proposed code path for Seekers, should we do the PartSize adjustment I did in my code? (A different option would be immediately reporting some "part size too small" error after discovering the size of the file.)

lsegal · 2015-05-13T02:37:44Z

After generating 10k parts, should s3manager immediately abort and report the failure to the user rather than trying to complete an invalid upload?

A better idea would be to require the user to pass in an "estimated size" for arbitrary streams so we can scale the part size appropriately. The Ruby SDK does this, for example.

Failing the upload at 10k parts would be unhelpful, as the user would have no way to complete the upload and they would get stuck with having to restart from scratch. If we can fail early, it would save quite a bit of time.

In the new proposed code path for Seekers, should we do the PartSize adjustment I did in my code? (A different option would be immediately reporting some "part size too small" error after discovering the size of the file.)

Yes. We do this in Ruby and JS. This is an outstanding known issue in the s3manager implementation (mostly because we don't yet support Seekable streams).

cespare · 2015-05-13T04:11:52Z

Failing the upload at 10k parts would be unhelpful, as the user would have no way to complete the upload and they would get stuck with having to restart from scratch. If we can fail early, it would save quite a bit of time.

Well, failing and aborting at that point would be more helpful than allowing the user to upload another 50k chunks :) And this still might be a case we want to handle if the user doesn't provide an estimated size (or if the estimated size was very wrong).

lsegal · 2015-05-13T17:38:33Z

Well, failing and aborting at that point would be more helpful than allowing the user to upload another 50k chunks

You are correct here-- we should check at the start in addition to failing at 10k.

And this still might be a case we want to handle if the user doesn't provide an estimated size

The idea would be that this would be a required field if the size is not computable, so it would not be possible to omit this.

Improves memory consumption on io.ReaderAt streams (files, in memory buffers) by avoiding creation of individual byte buffers. Increasing concurrency threshold for seekable streams will no longer impact memory usage. Fixes #232

The upload manager will now: * error if the configured PartSize is smaller than MinUploadPartSize, * adjust the part size if the upload will exceed MaxUploadParts, * error if the manager cannot determine the total upload size and it has uploaded more than MaxUploadParts parts. References #232

lsegal mentioned this issue May 28, 2015

s3/s3manager: Buffer parts with io.SectionReader when possible #250

Merged

lsegal added the enhancement label May 28, 2015

jasdel closed this as completed in #250 May 28, 2015

diehlaws mentioned this issue Oct 26, 2018

S3 uploader docs question #2228

Closed

diehlaws added feature-request A feature should be added or improved. and removed enhancement labels Jan 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3manager should be able to upload files without buffering the contents #232

s3manager should be able to upload files without buffering the contents #232

cespare commented May 13, 2015

lsegal commented May 13, 2015

cespare commented May 13, 2015

lsegal commented May 13, 2015

cespare commented May 13, 2015

lsegal commented May 13, 2015

s3manager should be able to upload files without buffering the contents #232

s3manager should be able to upload files without buffering the contents #232

Comments

cespare commented May 13, 2015

lsegal commented May 13, 2015

cespare commented May 13, 2015

lsegal commented May 13, 2015

cespare commented May 13, 2015

lsegal commented May 13, 2015