-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow creating references for empty archival datasets #260
allow creating references for empty archival datasets #260
Conversation
another example fwiw is this seaice product, it started in 1978 and initially only had data every 2 days, but became daily later ... the old binary files just skipped a day but the recent-ish netcdf rewrite filled the blanks with these empty files, they only have the dims, attributes, and the dummy crs variable (which now I think about it will probably be fine for virtual-ref).
that needs earthdata creds, so zipped and attached: NSIDC0051_SEAICE_PS_S25km_19781027_v2.0.nc.zip But, I wanted to put it out there because these empty files do occur for various reasons. |
thanks for confirming that these are not just broken files. In that case I wonder how to best support these: in theory, writing |
We should also check if size-0 Zarr arrays are possible. |
also fwiw as a todo for me, the GDAL autotest suite has some metadata-only examples, and I wanted to explore how xarray treats related, as opposed to zarr python itself, in case there was some misalignment in how GDAL should behave too: https://github.com/OSGeo/gdal/tree/master/autotest/gdrivers/data/zarr/array_attrs.zarr (xarray is fine with the empty-but-for-scalar-var netcdf, but not with the empty GDAL zarr) |
There's multiple issues with that (I think): Once those are fixed, |
I've repurposed this PR to instead allow reading and writing variables without chunks (detected by no chunks and This appears to work properly, but I need help with the typing of |
I think you might need to cast. I don't think numpy's handling of generics like this fully works yet. See also the PR I recently merged that fixed some similar typing errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how do these get represented in the manifest? Size-0 numpy arrays? If so have you tried concatenating them or doing any other operations to make sure that ManifestArray
doesn't break?
after some more investigation, I believe we won't be able to use VirtualiZarr/virtualizarr/manifests/manifest.py Lines 112 to 114 in e6407e0
entries={} to the suggestion below.
Another option would be to construct paths / offsets / lengths with the actual shape, but the values in paths would be the missing value marker (the |
I've done both (looks like |
assert all( | ||
len_chunk <= len_arr | ||
for len_arr, len_chunk in zip(expanded.shape, expanded.chunks) | ||
) | ||
assert expanded.manifest.dict() == {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice test!
So basically the concatenation works okay because under the hood the manifest still contains numpy arrays of the correct shape, they just have |
yes, that's it |
Great. That's presumably less efficient than not storing them explicitly, but it should be robust. |
probably faster, too, because we don't need to special-case empty chunk manifests (the memory footprint would be somewhat higher, I guess). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Ready to merge?
I guess maybe a note in the |
Thanks @keewis! |
Uninitialized variables don't have chunks, but can have a fill value set:
xarray
opens this file just fine, butkerchunk
won't return chunks for it.virtualizarr
uses the fill value to construct the variable instead, which doesn't work because the variable is not actually 0D. This usually indicates an issue with the files, so I chose to raise an error (it doesn't have to, though, so I'm not sure this is the best way forward).docs/releases.rst