-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More sharding work #2
Conversation
...and fix the test failure that happened as a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested using the 9846151.zarr sample from https://idr.github.io/ome-ngff-samples/ (dimensions 2947 x 5192 x 1402 x 3 x 1) also used for the NGFF challenge.
The conversion command was executed with zstd compression (to speed up conversion times and reduce size) for different variations of shard sizes
--shard | Outer chunk shape | Inner chunk shape | Conversion time | Dataset size | Number of chunks |
---|---|---|---|---|---|
[1,1,1,1024,1024] | 222m29.226s | 76G | 121975 | ||
SUPERCHUNK | [1,2,2,2048,2048] | [1,1,1,1024,1024] | 753m55.036s | 76G | 16825 |
1,1,1,4096,4096 | [1,1,1,4096,4096] | [1,1,1,1024,1024] | 686m33.655s | 76G | 92572 |
1,1,4,4096,4096 | [1,1,4,4096,4096] | [1,1,1,1024,1024] | 1966m36.416s | 76G | 67348 |
Overall, the numbers are consistent with the expectations from the shard dimensions. It is clear that as things stand sharding adds a conversion overhead. I'll try and reproduce the last measurement as the time feels like an outlier.
Thinking of possible next features up for discussion:
- chunk dimensions: for the dataset above, it would be really interesting to test a 3D rechunking (with and without shards)
- performance: I assume any form of parallalelization would need to happen at the outer chunk level
- metadata update: to be compliant with the proposed RFC-2 also used by https://github.com/ome/ome2024-ngff-challenge
I was able to reproduce the conversion times in #2 (review) i.e. ~1900m for I also ran a conversion with
I had not fully appreciated that [sbesson@pilot-zarr3-dev zarr2zarr]$ cat 9846151_superchunk_zstd.zarr/0/0/zarr.json
{"zarr_format":3,"node_type":"array","shape":[1,3,1402,5192,2947],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,2,2,2048,2048]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"sharding_indexed","configuration":{"chunk_shape":[1,1,1,1024,1024],"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"}],"index_codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"crc32c","name":"crc32c"}],"index_location":"end"},"name":"sharding_indexed"},{"name":"zstd","configuration":{"level":5,"checksum":true},"name":"zstd"}],"dimension_names":null,"attributes":{}}
[sbesson@pilot-zarr3-dev zarr2zarr]$ cat 9846151_superchunk_zstd.zarr/0/3/zarr.json
{"zarr_format":3,"node_type":"array","shape":[1,3,1402,649,368],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,2,2,1298,736]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"sharding_indexed","configuration":{"chunk_shape":[1,1,1,649,368],"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"}],"index_codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"crc32c","name":"crc32c"}],"index_location":"end"},"name":"sharding_indexed"},{"name":"zstd","configuration":{"level":5,"checksum":true},"name":"zstd"}],"dimension_names":null,"attributes":{}} while the custom sharding is a fixed value which will default to simple chunks if it exceeds the shape dimensions [sbesson@pilot-zarr3-dev zarr2zarr]$ cat 9846151_2x2x2048x2048_zstd.zarr/0/0/zarr.json
{"zarr_format":3,"node_type":"array","shape":[1,3,1402,5192,2947],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,2,2,2048,2048]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"sharding_indexed","configuration":{"chunk_shape":[1,1,1,1024,1024],"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"}],"index_codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"crc32c","name":"crc32c"}],"index_location":"end"},"name":"sharding_indexed"},{"name":"zstd","configuration":{"level":5,"checksum":true},"name":"zstd"}],"dimension_names":null,"attributes":{}}
[sbesson@pilot-zarr3-dev zarr2zarr]$ cat 9846151_2x2x2048x2048_zstd.zarr/0/3/zarr.json
{"zarr_format":3,"node_type":"array","shape":[1,3,1402,649,368],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,1,1,649,368]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"zstd","configuration":{"level":5,"checksum":true},"name":"zstd"}],"dimension_names":null,"attributes":{}} Note that in the first scenario there is still an issue as the shard/outer chunk shape for the 4th resolution Semi-related, I tried to include the latest zarr-java 0.0.4 release including some sharding validation on top of this branch. Several sharding unit tests are now failing during conversion with exceptions:
I believe this is related to the scenario above as the specified shard exceeds the array dimensions for a resolution lower than the largest one. We probably need to decide what the behavior of the library should be in these scenarios. I see two options:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed earlier with @melissalinkert. The zarr-java 0.0.4 issue has been raised upstream to confirm whether this was the accepted behavior.
Merging this PR as it gives us a first API to generate Zarr v3 datasets with arbitrary shard sizes. I'll also file some issues for the various points that came up as part of the code/functional review.
Follow-up to #1 and zarr-developers/zarr-java#5.
This fixes the check for valid shard/chunk size pairs, adds support for sharding in more than 2 dimensions, and allows custom shard sizes to be specified as a comma-separated array e.g.
--shard 1,1,2,1024,1024
.