Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object_store: multipart ranges for HTTP #4612

Closed
clbarnes opened this issue Aug 1, 2023 · 8 comments
Closed

object_store: multipart ranges for HTTP #4612

clbarnes opened this issue Aug 1, 2023 · 8 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@clbarnes
Copy link
Contributor

clbarnes commented Aug 1, 2023

HTTP range requests support multiple byte ranges. Could object_store?

object_store::GetOptions::range could take, instead of a core::ops::Range, something like this: https://github.com/clbarnes/byteranges-rs/blob/0a953e7c580e96b65fe28e61ed460d6e221dcd8d/src/request.rs#L53-L136 . Downstream users might need to add an extra .into::<Something>() to get the first hop of getting a Range into an HttpRange (per #4611 ) into a RangeHeader.

What is actually returned depends very much on the server, which, I think, means the only safe way to deal with the response is to have some representation of the whole file which you then only flesh out with the bits that the server sends back (which may be a single part, multiple parts, or the whole file). I have started work on a crate to deal with this: https://github.com/clbarnes/byteranges-rs/blob/main/src/response.rs which creates a Read/Seekable based on a rope containing a mixture of "filler" ranges and the actual response ranges.

@clbarnes clbarnes added the enhancement Any new improvement worthy of a entry in the changelog label Aug 1, 2023
@tustvold
Copy link
Contributor

tustvold commented Aug 2, 2023

I'm not aware of any cloud providers that support this, could you perhaps expand upon your use-case?

@clbarnes
Copy link
Contributor Author

clbarnes commented Aug 2, 2023

As part of the zarr project, we plan to store large tensors on a variety of backends (local/ HTTP/ object store), which are chunked into many separate files/ objects. As part of the sharding specification, each chunk (=shard) could contain many sub-chunks which are independently encoded and then concatenated. We'd want to read a footer to find the byte addresses of sub-chunks (see #4611 ), and then read (possibly multiple) byte ranges from the shard.

@tustvold
Copy link
Contributor

tustvold commented Aug 2, 2023

This is similar to what is done by the parquet readers, and get_ranges is specifically optimised for this use-case. It will perform multiple fetch requests in parallel for cloud stores, coalescing adjacent byte ranges. Stores that can support this natively, like LocalFilesystem and Memory, provide custom implementations

@clbarnes
Copy link
Contributor Author

clbarnes commented Aug 2, 2023

Ok, so the functionality is effectively supported, just at a different level to what I was suggesting. Presumably if a given cloud provider did support multipart/byteranges responses, it could also be implemented that way for that store?

Thank you!

@tustvold
Copy link
Contributor

I'm going to close this as I believe the requested functionality exists, feel free to reopen if I am mistaken

@JackKelly
Copy link

FWIW, I've just submitted a feature request to Google Cloud Storage to ask them to support multi-part byte ranges. Please upvote the feature request if you think this feature could be useful!

@JackKelly
Copy link

Oooh, cool, the Google Cloud folks replied to say:

the product engineering team is aware of this and currently in the process of evaluating it. While we cannot provide an estimated time of implementation or guarantee the fulfillment of the issue, please be assured that your input is highly valued

@clbarnes
Copy link
Contributor Author

Multipart ranges would definitely be useful in some situations but we probably still want an escape-hatch to allow getting ranges in parallel even besides backends which wouldn't support multipart (e.g. reading two large ranges rather than hundreds of tiny ranges). The HTTP spec is very broad as to what servers are allowed to send back - it doesn't need to be the same number of ranges, in the same order, or even the exact ranges you asked for. I suspect that last is true of single ranges too but it would be pretty psychopathic for a server to do anything besides the requested range or the full file.

I have an implementation of a synchronous Read/Seek-based sparse representation of a file made up of real data and (zero-cost) filler bytes. So you'd request ranges A, B, and C, then the server would return ranges X and Y (which may or may not map cleanly onto your request), then you build a local representation of the entire resource using X and Y, then read A, B, and C out of that. I suppose we might prefer something implementing bytes::Buf, but the data structure is probably the right idea to deal with the multipart response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants