-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 uploads do not appear to run in parallel #8343
Comments
What is exactly the problem you are trying to solve? We deliberately designed ocis to do uploads onto a quarantine area on the local disk to boost the upload performance. As a second step, we do all kinds of asynchronous postprocessing on that file. Virus Scanning, Content indexing, OCR on images and so on. After the postprocessing, the files are finally uploaded to S3. |
Hi Michael, thank you for the response. Please see the graphs attached. Loading 10 streams in parallel in S3 maxes out the throughput at a much higher rate than oCIS does this - referring to the S3 upload stage. Does oCIS parallelize the S3 uploads on the S3 back end, or does them sequentially? Is there a way to parallelize? The below graph shows max load with everything being the same - OS, network, storage. |
I still don’t understand the problem you are trying to solve 🤔 |
Notes from todays discussion - just for the record: |
Once a user has uploaded multiple large files and the files have completed post processing, upload via S3 appears to be sequential rather than utilising multipart uploading which significantly delays the availability of this file. This was implemented in OC10 as listed above. Can we investigate if this is possible to implement in OCIS? |
ocis will copy the blob into s3 whenever the postprocessing event is received. Depending on the concurrency multiple goroutines may be running in parallel. Each trying to upload a blob. For every upload, the I can see three put multipart implementations that might be called from
ocis will pass an We could make 👀 it seems we are not using parallel uploads because we set |
ok so this diff will use concurrent uploads: diff --git a/pkg/storage/fs/s3ng/blobstore/blobstore.go b/pkg/storage/fs/s3ng/blobstore/blobstore.go
index 9c744e754..4fa391ebb 100644
--- a/pkg/storage/fs/s3ng/blobstore/blobstore.go
+++ b/pkg/storage/fs/s3ng/blobstore/blobstore.go
@@ -71,7 +71,7 @@ func (bs *Blobstore) Upload(node *node.Node, source string) error {
}
defer reader.Close()
- _, err = bs.client.PutObject(context.Background(), bs.bucket, bs.path(node), reader, node.Blobsize, minio.PutObjectOptions{ContentType: "application/octet-stream", SendContentMd5: true})
+ _, err = bs.client.PutObject(context.Background(), bs.bucket, bs.path(node), reader, node.Blobsize, minio.PutObjectOptions{ContentType: "application/octet-stream"})
if err != nil {
return errors.Wrapf(err, "could not store object '%s' into bucket '%s'", bs.path(node), bs.bucket) Let me see what |
IIRC we needed that parameter for a specific S3 implementation and just hardcoded it. So we need to make it configurable. 🥳 Probably together with the |
Adding this to the project board so someone can prioritize. |
The mentioned param is needed to fulfil the bucket policy. was brought up by @wkloucek due to an incindent. |
yup:
|
Needs dependency bump and re-testing. |
needs
|
Describe the bug
A clear and concise description of what the bug is.
Using s3ng, uploads commence after exiting quarantine, but do not appear to parallelize, resulting in far slower throughput than expected.
Steps to reproduce
s3-benchmark -a -s -b -t 10 -z 16M -u
Expected behavior
Similar performance with a parallel benchmark and oCIS
Actual behavior
oCIS is about 6 times slower
Setup
Note that we are measuring peak sustained throughput, not the timing (quarantine, postprocessing, etc.). This has been tested as a binary, and in a K8s setup with similar results.
STORAGE_USERS_OCIS_ASYNC_UPLOADS is set to true
The text was updated successfully, but these errors were encountered: