Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MultiPartStore (#4961) (#4608) #4971

Merged
merged 7 commits into from
Oct 25, 2023
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Oct 21, 2023

Which issue does this PR close?

Closes #4961
Closes #4608
Closes #4965

Rationale for this change

Some use-cases wish to interact with the multipart upload machinery directly, whilst we can't support this for all stores, we can provide it for many. This therefore follows the example of #4876 and extracts a trait that can then be implemented by the various stores that have such a notion

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the object-store Object Store Interface label Oct 21, 2023
path: &Path,
id: &MultipartId,
part_idx: usize,
data: Bytes,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having Bytes here instead of Vec<u8> is great, as it means we don't need to own the data. Can we also make that change to PutPart's put_part() method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do, but I would rather do that as a separate PR as it would be a breaking change, and given PutPart is largely intended to be used with WriteMultipart I'm less sure of the utility

&self,
path: &Path,
id: &MultipartId,
part_idx: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do I need a part IDX here if I need to provide a sorted list to complete_multipart anyways?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that you can upload chunks in parallel and out of order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I have to order them at the upload time or completion time?

Copy link
Contributor Author

@tustvold tustvold Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At completion time you must provide them in the order you want them to be (although AWS doesn't actually care)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So part_idx goes from 0..N (N being the number of parts, are holes allowed?) and the list passed to the complete call must have the same order as if I would order the results of PUT calls by the part_id arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to document this better

object_store/src/multipart.rs Show resolved Hide resolved
Comment on lines 286 to 287
/// Most stores require that all parts apart from the final are at least 5 MiB. Additionally
/// some stores require all parts excluding the last to be the same size, e.g. [R2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the store at least tell the user which size it expects? Currently this relies on expert knowledge of the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriteMultipart currently always uses a fixed size of 10MiB, I will update the docs to reflect this

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any tests for this new API -- can we please add coverage somehow?

data: Bytes,
) -> Result<PartId>;

/// Completes a multipart upload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also document here what the expectations are for a successful completion? For example, add the note about "parts is the order you want the parts to appear in the final output"

Also can we say anything about what happens to PartId that are not used (as in put_part is called and returns a PartId but then that PartId is not supplied here)?

@alamb
Copy link
Contributor

alamb commented Oct 24, 2023

(BTW thank you very much for doing this @tustvold -- it is great to see the object_store crate's API evolve and that the underlying design accommodates these new features without major issue)

@tustvold
Copy link
Contributor Author

I've tried to add some docs clarifying the API contract

/// The `i`'th value of `parts` must be a [`PartId`] returned by a call to [`Self::put_part`]
/// with a `part_idx` of `i`, and the same `path` and `id` as provided to this method. Calling
/// this method with out of sequence or repeated [`PartId`], or [`PartId`] returned for other
/// values of `path` or `id`, will result in implementation-defined behaviour
Copy link
Contributor Author

@tustvold tustvold Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an example of implementation defined behaviour, Azure will produce a file with parts in the order passed to this function, regardless of if they are out of sequence, and if they were uploaded with different values of MultipartId.

Other stores do various combinations of erroring, or producing files with parts in different orders. Provided people stick to the API contract though, the behaviour is consistent

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the documentation. My only remaining question is about testing this API -- as it stands now I think it could be removed without anything breaking -- perhaps we could add a doc test somewhere that shows how it is used that would also serve to make it more discoverable?

Finally, if possible, it would be awesome to add a doc link from put_multipart to MultiPartStore to help people who need more power to find it more easily

@tustvold
Copy link
Contributor Author

I've added an integration test, a doc test would be tricky as it requires a running service

@@ -263,7 +263,6 @@ pub use client::{
#[cfg(feature = "cloud")]
mod config;

#[cfg(feature = "cloud")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed in part because it makes the docs sad otherwise, but also because the contained interfaces don't require any additional dependencies and are useful even when not using any of the first-party backends

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- this looks great.

@alamb alamb merged commit e78d140 into apache:master Oct 25, 2023
13 checks passed
@alamb alamb mentioned this pull request Oct 31, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
4 participants