-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3manager should be able to upload files without buffering the contents #232
Comments
@cespare I would be open to this implementation. This came up in discussion today actually. If we can cast to a |
I have a related question, too. (Maybe this should be a separate issue.) It seems that S3 won't accept multipart uploads of more than 10k parts, as I discovered when uploading some large files. So in my own uploader, I checked the total number of parts and if it's >10k, I bump the part size up such that there will be exactly 10k parts. In s3manager, for arbitrary
|
A better idea would be to require the user to pass in an "estimated size" for arbitrary streams so we can scale the part size appropriately. The Ruby SDK does this, for example. Failing the upload at 10k parts would be unhelpful, as the user would have no way to complete the upload and they would get stuck with having to restart from scratch. If we can fail early, it would save quite a bit of time.
Yes. We do this in Ruby and JS. This is an outstanding known issue in the s3manager implementation (mostly because we don't yet support Seekable streams). |
Well, failing and aborting at that point would be more helpful than allowing the user to upload another 50k chunks :) And this still might be a case we want to handle if the user doesn't provide an estimated size (or if the estimated size was very wrong). |
You are correct here-- we should check at the start in addition to failing at 10k.
The idea would be that this would be a required field if the size is not computable, so it would not be possible to omit this. |
Improves memory consumption on io.ReaderAt streams (files, in memory buffers) by avoiding creation of individual byte buffers. Increasing concurrency threshold for seekable streams will no longer impact memory usage. Fixes #232
The upload manager will now: * error if the configured PartSize is smaller than MinUploadPartSize, * adjust the part size if the upload will exceed MaxUploadParts, * error if the manager cannot determine the total upload size and it has uploaded more than MaxUploadParts parts. References #232
It's nice that s3manager can upload arbitrary
io.Reader
s, but the common case is probably*os.File
, and in this case the behavior of buffering entire chunks leads to unnecessary memory usage and garbage. (Even with perfect GC you'd expect s3manager to requireConcurrency × PartSize
for the live chunks.)I wrote my own s3 uploader that's only for files; it uses an
io.SectionReader
for each chunk and avoids the buffering. In my test, using concurrency 10 and partsize 10MB, this change was the difference in steady-state memory usage of ~700MB and ~200MB.In s3manager, we could check whether the provided
io.Reader
implementsReadAt
andSeek
, and if it does, do the no-copy logic I described. I can make a PR but I wanted to ask whether this seems like an idea you would want to incorporate first.I put a public version of my uploader here in case that's helpful.
The text was updated successfully, but these errors were encountered: