Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow creating references for empty archival datasets #260

Merged
merged 22 commits into from
Oct 18, 2024

Conversation

keewis
Copy link
Contributor

@keewis keewis commented Oct 17, 2024

Uninitialized variables don't have chunks, but can have a fill value set:

import h5netcdf

with h5netcdf.File("test.nc", mode="w") as f:
    f.dimensions = {"x": 100, "y": 200}
    f.create_variable("var", dimensions=("x", "y"), dtype="int64", fillvalue=100, chunks=(50, 100))

xarray opens this file just fine, but kerchunk won't return chunks for it. virtualizarr uses the fill value to construct the variable instead, which doesn't work because the variable is not actually 0D. This usually indicates an issue with the files, so I chose to raise an error (it doesn't have to, though, so I'm not sure this is the best way forward).

  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst

@mdsumner
Copy link
Contributor

another example fwiw is this seaice product, it started in 1978 and initially only had data every 2 days, but became daily later ... the old binary files just skipped a day but the recent-ish netcdf rewrite filled the blanks with these empty files, they only have the dims, attributes, and the dummy crs variable (which now I think about it will probably be fine for virtual-ref).

https://n5eil01u.ecs.nsidc.org/PM/NSIDC-0051.002/1978.10.27/NSIDC0051_SEAICE_PS_S25km_19781027_v2.0.nc

that needs earthdata creds, so zipped and attached:

NSIDC0051_SEAICE_PS_S25km_19781027_v2.0.nc.zip

But, I wanted to put it out there because these empty files do occur for various reasons.

@keewis
Copy link
Contributor Author

keewis commented Oct 17, 2024

I wanted to put it out there because these empty files do occur for various reasons.

thanks for confirming that these are not just broken files.

In that case I wonder how to best support these: in theory, writing .zarray / .zattrs should be enough, since zarr will also use the fill_value to fill missing chunks. In other words, how do we generate a ManifestArray that does not contain chunks (i.e., a size-0 ManifestArray, but with the shape / chunksizes from zarray)?

@TomNicholas
Copy link
Member

TomNicholas commented Oct 17, 2024

Size-0 ManifestArray

ManifestArrays are just 3 numpy arrays in a trenchcoat, and you can have length-0 numpy arrays right? So this should be possible, but might need some special casing.

We should also check if size-0 Zarr arrays are possible.

@mdsumner
Copy link
Contributor

mdsumner commented Oct 17, 2024

also fwiw as a todo for me, the GDAL autotest suite has some metadata-only examples, and I wanted to explore how xarray treats related, as opposed to zarr python itself, in case there was some misalignment in how GDAL should behave too:

https://github.com/OSGeo/gdal/tree/master/autotest/gdrivers/data/zarr/array_attrs.zarr

(xarray is fine with the empty-but-for-scalar-var netcdf, but not with the empty GDAL zarr)

@keewis
Copy link
Contributor Author

keewis commented Oct 17, 2024

https://github.com/OSGeo/gdal/tree/master/autotest/gdrivers/data/zarr/array_attrs.zarr

There's multiple issues with that (I think): .zarray is not valid json (it uses single quotes instead of double quotes for the dtype), and !b1 is unknown to numpy.dtype (remove the exclamation mark, maybe?).

Once those are fixed, xarray uses zarr.open_group to open files, which complains about the directory being an array, not a group. To fix that, you'd have to move .zarray and .zattrs into a subdirectory (var) and create a .zgroup file one level up. And finally, xarray needs the _ARRAY_DIMENSIONS attribute to be set and contain dimension names.

@keewis keewis changed the title improve the error message for empty archival datasets allow creating references for empty archival datasets Oct 17, 2024
@keewis
Copy link
Contributor Author

keewis commented Oct 17, 2024

I've repurposed this PR to instead allow reading and writing variables without chunks (detected by no chunks and ndim > 0).

This appears to work properly, but I need help with the typing of ChunkManifest.empty: why does mypy think np.array((), dtype=np.dtypes.StringDType) is of type ndarray[Any, dtype[Any]]? Do I need to use cast for that?

@TomNicholas
Copy link
Member

I think you might need to cast. I don't think numpy's handling of generics like this fully works yet. See also the PR I recently merged that fixed some similar typing errors.

@TomNicholas TomNicholas added the references generation Reading byte ranges from archival files label Oct 17, 2024
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how do these get represented in the manifest? Size-0 numpy arrays? If so have you tried concatenating them or doing any other operations to make sure that ManifestArray doesn't break?

virtualizarr/readers/kerchunk.py Show resolved Hide resolved
@keewis
Copy link
Contributor Author

keewis commented Oct 18, 2024

after some more investigation, I believe we won't be able to use entries={} (nor size-0 arrays) as-is: we need to somehow pass the chunk grid (which exists, there's just no data we can refer to) to ChunkManifest. One option would be to follow this comment:

# TODO should we actually optionally pass chunk grid shape in,
# in case there are not enough chunks to give correct idea of full shape?
shape = get_chunk_grid_shape(entries.keys())
and explicitly pass the chunk grid shape. This would allow us to translate entries={} to the suggestion below.

Another option would be to construct paths / offsets / lengths with the actual shape, but the values in paths would be the missing value marker (the na_object parameter to np.dtypes.StringDType) – then the concatenation would immediately work, but manifest.dict() would have to skip over any entry where the path is missing.

@keewis
Copy link
Contributor Author

keewis commented Oct 18, 2024

I've done both (looks like path == "" already meant missing chunk, so I'm just using that), tell me what you think

Comment on lines +232 to +236
assert all(
len_chunk <= len_arr
for len_arr, len_chunk in zip(expanded.shape, expanded.chunks)
)
assert expanded.manifest.dict() == {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice test!

@TomNicholas
Copy link
Member

So basically the concatenation works okay because under the hood the manifest still contains numpy arrays of the correct shape, they just have path='' for all elements?

@keewis
Copy link
Contributor Author

keewis commented Oct 18, 2024

yes, that's it

@TomNicholas
Copy link
Member

TomNicholas commented Oct 18, 2024

Great. That's presumably less efficient than not storing them explicitly, but it should be robust.

@keewis
Copy link
Contributor Author

keewis commented Oct 18, 2024

probably faster, too, because we don't need to special-case empty chunk manifests (the memory footprint would be somewhat higher, I guess).

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Ready to merge?

@TomNicholas
Copy link
Member

I guess maybe a note in the docs/releases.rst changelog.

@TomNicholas TomNicholas merged commit 7053bc0 into zarr-developers:main Oct 18, 2024
10 checks passed
@TomNicholas
Copy link
Member

Thanks @keewis!

@keewis keewis deleted the error-mismatching-shape branch October 18, 2024 15:30
@keewis keewis mentioned this pull request Oct 18, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
references generation Reading byte ranges from archival files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants