Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for single-step write to S3, instead of multi-part #1219

Merged
merged 2 commits into from
May 1, 2019

Conversation

ihnorton
Copy link
Member

@ihnorton ihnorton commented Apr 23, 2019

This PR adds support for writing to an S3-compatible store with the PutObject method, rather the multi-part upload API. Primarily for compatibility with GCS, where the S3 compatibility mode
does not support multi-part uploads. Path is controlled by the new config parameter s3.use_multipart_upload, with default true.

@ihnorton ihnorton force-pushed the ihn/s3_no_multipart branch from ef018b6 to 623e679 Compare April 30, 2019 11:40
@ihnorton
Copy link
Member Author

ihnorton commented Apr 30, 2019

This is ready for review.

I've tested it with arrays up to 1GB on GCS, and it seems to work fine. There is a configuration override required currently -- in order to use a buffer that is larger than the ~40MB available with config defaults, I set these options:

# set the maximum upload buffer to 50GB
config["vfs.s3.multipart_part_size"] = int(5e10)
config["vfs.s3.max_parallel_ops"] = 1

This could be an additional, separate config option, or we could consider:

  • leave it as-is while the feature shakes out, rather than committing to a new config option now
  • renaming multipart_part_size to (something like) max_upload_size and document it as either per-chunk-max or direct-write-max depending on whether use_multipart_upload is enabled.

Noting the limits of various services here:

Copy link
Contributor

@tdenniston tdenniston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- it might be worth adding a GCS section to https://docs.tiledb.io/en/stable/tutorials/working-with-s3.html to document this. I think sticking with the existing config param is fine for now.

tiledb/sm/storage_manager/config.cc Outdated Show resolved Hide resolved
@tdenniston tdenniston added the s3 label Apr 30, 2019
@tdenniston tdenniston added this to the 1.6.0 milestone Apr 30, 2019
@ihnorton ihnorton force-pushed the ihn/s3_no_multipart branch from 623e679 to fa62416 Compare April 30, 2019 22:14
@ihnorton
Copy link
Member Author

will rebase on #1238 if that passes CI.

* Primarily for compatibility with GCS, where the S3 compatibility mode
does not support S3 multi-part uploads. Controlled by the new
config parameter `s3.use_multipart_upload`, default `true`.

* Add test for S3 direct write

* Add test for S3 direct write buffer size mismatch

* Add note to HISTORY
@ihnorton ihnorton force-pushed the ihn/s3_no_multipart branch from fa62416 to a949646 Compare May 1, 2019 04:10
@ihnorton ihnorton merged commit 284838f into dev May 1, 2019
@ihnorton ihnorton deleted the ihn/s3_no_multipart branch May 1, 2019 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants