-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix preset sharding options and add tests #1
Conversation
This wasn't necessary when building locally, but hopefully fixes CI builds.
Tested against a 384 wells plates (https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0128E/9701.zarr) with images of size 2048x2048. Using the different sharding option
The SUPERCHUNK configuration raised warnings related to incompatible sizes which is suprising as the 2048x2048 array size should be divided into 4 1024x1024 inner chunks and I would expect this configuration to be identical to the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested the conversion on a 3D image (https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0048A/9846151.zarr/)
[sbesson@pilot-zarr3-dev idr0048]$ time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr 9846151.zarr/ 9846151_v3_default.zarr/
21:19:11.481 [main] INFO com.glencoesoftware.zarr.Convert -- opened 9846151.zarr/0
21:19:11.498 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
21:19:14.222 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
21:19:14.258 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/0
23:17:45.866 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/1
23:50:01.207 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/2
23:58:55.521 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/3
00:02:02.136 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/4
00:04:15.280 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/5
real 166m17.857s
user 21m3.538s
sys 5m7.594s
[sbesson@pilot-zarr3-dev idr0048]$ time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr 9846151.zarr/ 9846151_v3_single.zarr/ |
--shard single |
04:30:52.405 [main] INFO com.glencoesoftware.zarr.Convert -- opened 9846151.zarr/0 |
04:30:52.414 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes |
04:30:54.712 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions |
04:30:54.834 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/0 |
04:30:54.835 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes |
06:29:44.030 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/1 |
06:29:44.030 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes |
07:01:48.022 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/2 |
07:01:48.022 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes |
07:11:04.974 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/3 |
07:14:04.809 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/4 |
07:16:17.589 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/5 |
|
real 166m58.189s |
user 20m39.577s |
sys 5m3.107s |
[sbesson@pilot-zarr3-dev idr0048]$ time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr 9846151.zarr/ 9846151_v3_superchunk.za|
rr/ --shard superchunk |
07:44:39.263 [main] INFO com.glencoesoftware.zarr.Convert -- opened 9846151.zarr/0 |
07:44:39.270 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes |
07:44:41.304 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions |
07:44:41.379 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/0 |
07:44:41.380 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes |
09:31:50.220 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/1 |
09:31:50.221 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes |
09:56:35.121 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/2
09:56:35.121 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
10:02:39.938 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/3
10:04:34.103 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/4
10:05:33.205 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/5
real 141m30.396s
user 19m27.082s
sys 4m37.969s
[sbesson@pilot-zarr3-dev idr0048]$ ls
9846151.zarr 9846151_v3_default.zarr 9846151_v3_single.zarr 9846151_v3_superchunk.zarr
[sbesson@pilot-zarr3-dev idr0048]$ du -csh *
133G 9846151.zarr
212G 9846151_v3_default.zarr
212G 9846151_v3_single.zarr
212G 9846151_v3_superchunk.zarr
768G total
[sbesson@pilot-zarr3-dev idr0048]$ find 9846151.zarr -type f | wc
121987 121987 3562972
[sbesson@pilot-zarr3-dev idr0048]$ find 9846151_v3_default.zarr/ -type f | wc
121984 121984 5148693
[sbesson@pilot-zarr3-dev idr0048]$ find 9846151_v3_single.zarr/ -type f | wc
121984 121984 5026709
[sbesson@pilot-zarr3-dev idr0048]$ find 9846151_v3_superchunk.zarr/ -type f | wc
121984 121984 5514645
The difference in size made me realize that the default compression is none, so ran another round of conversion with sharding and blosc compression and the increased verbosity as per the last commit against the sample plate
[sbesson@pilot-zarr3-dev idr0128]$ time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr 9701.zarr/ 9701_v3_single_blosc.zarr --shard single --compression blosc --debug
...
real 12m19.188s
user 5m44.453s
sys 0m37.196s
[sbesson@pilot-zarr3-dev idr0128]$ du -csh *
8.0G 9701.zarr
8.0G 9701_v3_chunk.zarr
8.0G 9701_v3_default.zarr
8.0G 9701_v3_single.zarr
5.5G 9701_v3_single_blosc.zarr
8.0G 9701_v3_superchunk.zarr
46G total
Possibly next thing to look into is why sharding is rejected for the 3D image. But happy for this to be merged.
This fixes the three "preset"
--shard
options (SINGLE
,CHUNK
, andSUPERCHUNK
) based on comments in zarr-developers/zarr-java#5.Using
--shard SINGLE
attempts to create a v3 dataset with a single shard covering the entire array.--shard CHUNK
creates one shard per chunk, and--shard SUPERCHUNK
attempts to create shards with 2x2 chunks per shard.The
--shard
option has no effect on writing v2 data, so should only be used in the context of converting v2 to v3.As the
chunkAndShardCompatible
method suggests, as far as I can tell the chunk, shard, and array shape all need to divide evenly into each other or an exception will be thrown. Those cases should be caught with a warning now, but use caution if testing with v2 input data that has array shapes that are not an exact multiple of the chunk size.There is a placeholder here for allowing custom shard sizes to be specified; I plan to implement that in a separate PR.