-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
object_store: multipart ranges for HTTP #4612
Comments
I'm not aware of any cloud providers that support this, could you perhaps expand upon your use-case? |
As part of the zarr project, we plan to store large tensors on a variety of backends (local/ HTTP/ object store), which are chunked into many separate files/ objects. As part of the sharding specification, each chunk (=shard) could contain many sub-chunks which are independently encoded and then concatenated. We'd want to read a footer to find the byte addresses of sub-chunks (see #4611 ), and then read (possibly multiple) byte ranges from the shard. |
This is similar to what is done by the parquet readers, and get_ranges is specifically optimised for this use-case. It will perform multiple fetch requests in parallel for cloud stores, coalescing adjacent byte ranges. Stores that can support this natively, like LocalFilesystem and Memory, provide custom implementations |
Ok, so the functionality is effectively supported, just at a different level to what I was suggesting. Presumably if a given cloud provider did support Thank you! |
I'm going to close this as I believe the requested functionality exists, feel free to reopen if I am mistaken |
FWIW, I've just submitted a feature request to Google Cloud Storage to ask them to support multi-part byte ranges. Please upvote the feature request if you think this feature could be useful! |
Oooh, cool, the Google Cloud folks replied to say:
|
Multipart ranges would definitely be useful in some situations but we probably still want an escape-hatch to allow getting ranges in parallel even besides backends which wouldn't support multipart (e.g. reading two large ranges rather than hundreds of tiny ranges). The HTTP spec is very broad as to what servers are allowed to send back - it doesn't need to be the same number of ranges, in the same order, or even the exact ranges you asked for. I suspect that last is true of single ranges too but it would be pretty psychopathic for a server to do anything besides the requested range or the full file. I have an implementation of a synchronous Read/Seek-based sparse representation of a file made up of real data and (zero-cost) filler bytes. So you'd request ranges A, B, and C, then the server would return ranges X and Y (which may or may not map cleanly onto your request), then you build a local representation of the entire resource using X and Y, then read A, B, and C out of that. I suppose we might prefer something implementing |
HTTP range requests support multiple byte ranges. Could object_store?
object_store::GetOptions::range
could take, instead of acore::ops::Range
, something like this: https://github.com/clbarnes/byteranges-rs/blob/0a953e7c580e96b65fe28e61ed460d6e221dcd8d/src/request.rs#L53-L136 . Downstream users might need to add an extra.into::<Something>()
to get the first hop of getting aRange
into anHttpRange
(per #4611 ) into aRangeHeader
.What is actually returned depends very much on the server, which, I think, means the only safe way to deal with the response is to have some representation of the whole file which you then only flesh out with the bits that the server sends back (which may be a single part, multiple parts, or the whole file). I have started work on a crate to deal with this: https://github.com/clbarnes/byteranges-rs/blob/main/src/response.rs which creates a
Read
/Seek
able based on a rope containing a mixture of "filler" ranges and the actual response ranges.The text was updated successfully, but these errors were encountered: