Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sharding for remote datasets #6823

Closed
fm3 opened this issue Feb 7, 2023 · 3 comments · Fixed by #6920
Closed

Support sharding for remote datasets #6823

fm3 opened this issue Feb 7, 2023 · 3 comments · Fixed by #6920

Comments

@fm3
Copy link
Member

fm3 commented Feb 7, 2023

Both Neuroglancer Precomputed and Zarr (v3) can have sharded data formats. When served from a file server, range requests should be used to fetch only individual chunks out of the shards.

This needs to

  • understand the sharding metadata of the datasets to figure out byte offsets and lengths of the chunks to request
  • adapt the NIO file systems to support such partial reads
  • adapt the data reading code to use those partial reads (both in local and remote case)

Possibly related:
https://stackoverflow.com/questions/35745403/java-how-to-read-part-of-file-from-specified-position-of-bytes
My assumption is that in the NIO implementation this will use SeekableByteChannel, which is implemented by the different NIO File System implementations. The HttpsFileSystem currently reads the whole file before returning a SeekableByteChannel, needs to be adapted to use http range requests. I don’t know how the S3 and GCS File System adapters behave.

@frcroth
Copy link
Member

frcroth commented Feb 7, 2023

S3FileSystem seems to download the entire file and then return a Seekable channel on that file, so that wouldn't work for sharding.

@fm3
Copy link
Member Author

fm3 commented Feb 7, 2023

Yeah, I hope we can change that, maybe the S3 client used by the s3fs has a specialized kind of GetObjectRequest we could use

@fm3
Copy link
Member Author

fm3 commented Feb 9, 2023

I think a good first step would be to adapt the https file system to also see how well the NIO api works with the range requests. Feel free to exchange the http library used there if needed (I think the one I plugged in there has been discontinued in the meantime)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants