object_store: multipart ranges for HTTP #4612

clbarnes · 2023-08-01T10:23:33Z

HTTP range requests support multiple byte ranges. Could object_store?

object_store::GetOptions::range could take, instead of a core::ops::Range, something like this: https://github.com/clbarnes/byteranges-rs/blob/0a953e7c580e96b65fe28e61ed460d6e221dcd8d/src/request.rs#L53-L136 . Downstream users might need to add an extra .into::<Something>() to get the first hop of getting a Range into an HttpRange (per #4611 ) into a RangeHeader.

What is actually returned depends very much on the server, which, I think, means the only safe way to deal with the response is to have some representation of the whole file which you then only flesh out with the bits that the server sends back (which may be a single part, multiple parts, or the whole file). I have started work on a crate to deal with this: https://github.com/clbarnes/byteranges-rs/blob/main/src/response.rs which creates a Read/Seekable based on a rope containing a mixture of "filler" ranges and the actual response ranges.

The text was updated successfully, but these errors were encountered:

tustvold · 2023-08-02T08:51:16Z

I'm not aware of any cloud providers that support this, could you perhaps expand upon your use-case?

clbarnes · 2023-08-02T09:22:39Z

As part of the zarr project, we plan to store large tensors on a variety of backends (local/ HTTP/ object store), which are chunked into many separate files/ objects. As part of the sharding specification, each chunk (=shard) could contain many sub-chunks which are independently encoded and then concatenated. We'd want to read a footer to find the byte addresses of sub-chunks (see #4611 ), and then read (possibly multiple) byte ranges from the shard.

tustvold · 2023-08-02T09:46:09Z

This is similar to what is done by the parquet readers, and get_ranges is specifically optimised for this use-case. It will perform multiple fetch requests in parallel for cloud stores, coalescing adjacent byte ranges. Stores that can support this natively, like LocalFilesystem and Memory, provide custom implementations

clbarnes · 2023-08-02T09:53:03Z

Ok, so the functionality is effectively supported, just at a different level to what I was suggesting. Presumably if a given cloud provider did support multipart/byteranges responses, it could also be implemented that way for that store?

Thank you!

tustvold · 2023-08-11T15:02:23Z

I'm going to close this as I believe the requested functionality exists, feel free to reopen if I am mistaken

JackKelly · 2023-12-13T16:17:00Z

FWIW, I've just submitted a feature request to Google Cloud Storage to ask them to support multi-part byte ranges. Please upvote the feature request if you think this feature could be useful!

JackKelly · 2023-12-14T19:17:18Z

Oooh, cool, the Google Cloud folks replied to say:

the product engineering team is aware of this and currently in the process of evaluating it. While we cannot provide an estimated time of implementation or guarantee the fulfillment of the issue, please be assured that your input is highly valued

clbarnes · 2023-12-14T23:03:30Z

Multipart ranges would definitely be useful in some situations but we probably still want an escape-hatch to allow getting ranges in parallel even besides backends which wouldn't support multipart (e.g. reading two large ranges rather than hundreds of tiny ranges). The HTTP spec is very broad as to what servers are allowed to send back - it doesn't need to be the same number of ranges, in the same order, or even the exact ranges you asked for. I suspect that last is true of single ranges too but it would be pretty psychopathic for a server to do anything besides the requested range or the full file.

I have an implementation of a synchronous Read/Seek-based sparse representation of a file made up of real data and (zero-cost) filler bytes. So you'd request ranges A, B, and C, then the server would return ranges X and Y (which may or may not map cleanly onto your request), then you build a local representation of the entire resource using X and Y, then read A, B, and C out of that. I suppose we might prefer something implementing bytes::Buf, but the data structure is probably the right idea to deal with the multipart response.

clbarnes added the enhancement Any new improvement worthy of a entry in the changelog label Aug 1, 2023

clbarnes mentioned this issue Aug 2, 2023

ZEP0002 Review zarr-developers/zarr-specs#254

Closed

tustvold closed this as completed Aug 11, 2023

JackKelly mentioned this issue Dec 15, 2023

Cache chunks in RAM JackKelly/light-speed-io#9

Open

clbarnes mentioned this issue Dec 15, 2023

object_store: suffix requests #5206

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

object_store: multipart ranges for HTTP #4612

object_store: multipart ranges for HTTP #4612

clbarnes commented Aug 1, 2023

tustvold commented Aug 2, 2023

clbarnes commented Aug 2, 2023

tustvold commented Aug 2, 2023 •

edited

Loading

clbarnes commented Aug 2, 2023

tustvold commented Aug 11, 2023

JackKelly commented Dec 13, 2023

JackKelly commented Dec 14, 2023

clbarnes commented Dec 14, 2023

object_store: multipart ranges for HTTP #4612

object_store: multipart ranges for HTTP #4612

Comments

clbarnes commented Aug 1, 2023

tustvold commented Aug 2, 2023

clbarnes commented Aug 2, 2023

tustvold commented Aug 2, 2023 • edited Loading

clbarnes commented Aug 2, 2023

tustvold commented Aug 11, 2023

JackKelly commented Dec 13, 2023

JackKelly commented Dec 14, 2023

clbarnes commented Dec 14, 2023

tustvold commented Aug 2, 2023 •

edited

Loading