Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: '.zarray' with HDF5 data #159

Closed
scottyhq opened this issue Jun 25, 2024 · 6 comments · Fixed by #165
Closed

KeyError: '.zarray' with HDF5 data #159

scottyhq opened this issue Jun 25, 2024 · 6 comments · Fixed by #165
Labels
bug Something isn't working references generation Reading byte ranges from archival files

Comments

@scottyhq
Copy link
Contributor

I'm getting a KeyError Traceback for several different HDF5 files.

For example, trying to use this test file from kerchunk:
https://github.com/fsspec/kerchunk/blob/ae692fead51a216691e4db9a67c99194c5ba8e14/kerchunk/tests/test_hdf.py#L307

# (or local path)
url = 'https://github.com/fsspec/kerchunk/raw/ae692fead51a216691e4db9a67c99194c5ba8e14/kerchunk/tests/NEONDSTowerTemperatureData.hdf5'
kerchunk.hdf.SingleHdf5ToZarr(url).translate()

# side note: also not sure why `reader_options={}` is required to read from the URL
vds = open_virtual_dataset(url, reader_options={})
File ~/GitHub/VirtualiZarr/virtualizarr/xarray.py:114, in open_virtual_dataset(filepath, filetype, drop_variables, loadable_variables, indexes, virtual_array_class, reader_options)
    106 else:
    107     # this is the only place we actually always need to use kerchunk directly
    108     # TODO avoid even reading byte ranges for variables that will be dropped later anyway?
    109     vds_refs = kerchunk.read_kerchunk_references_from_file(
    110         filepath=filepath,
    111         filetype=filetype,
    112         reader_options=reader_options,
    113     )
--> 114     virtual_vars = virtual_vars_from_kerchunk_refs(
    115         vds_refs,
    116         drop_variables=drop_variables + loadable_variables,
    117         virtual_array_class=virtual_array_class,
    118     )
    119     ds_attrs = kerchunk.fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})
    120     coord_names = ds_attrs.pop("coordinates", [])

File ~/GitHub/VirtualiZarr/virtualizarr/xarray.py:247, in virtual_vars_from_kerchunk_refs(refs, drop_variables, virtual_array_class)
    241     drop_variables = []
    242 var_names_to_keep = [
    243     var_name for var_name in var_names if var_name not in drop_variables
    244 ]
    246 vars = {
--> 247     var_name: variable_from_kerchunk_refs(refs, var_name, virtual_array_class)
    248     for var_name in var_names_to_keep
    249 }
    250 return vars

File ~/GitHub/VirtualiZarr/virtualizarr/xarray.py:293, in variable_from_kerchunk_refs(refs, var_name, virtual_array_class)
    290 """Create a single xarray Variable by reading specific keys of a kerchunk references dict."""
    292 arr_refs = kerchunk.extract_array_refs(refs, var_name)
--> 293 chunk_dict, zarray, zattrs = kerchunk.parse_array_refs(arr_refs)
    295 manifest = ChunkManifest._from_kerchunk_chunk_dict(chunk_dict)
    297 # we want to remove the _ARRAY_DIMENSIONS from the final variables' .attrs

File ~/GitHub/VirtualiZarr/virtualizarr/kerchunk.py:186, in parse_array_refs(arr_refs)
    183 def parse_array_refs(
    184     arr_refs: KerchunkArrRefs,
    185 ) -> tuple[dict, ZArray, ZAttrs]:
--> 186     zarray = ZArray.from_kerchunk_refs(arr_refs.pop(".zarray"))
    187     zattrs = arr_refs.pop(".zattrs", {})
    188     chunk_dict = arr_refs

KeyError: '.zarray'
@TomNicholas TomNicholas added the bug Something isn't working label Jun 25, 2024
@TomNicholas
Copy link
Member

This file has nested groups, which we don't support yet. See #84.

The error is because the current code for parsing kerchunk references essentially assumes that each group is an array (i.e. that there is only one group - the root). It errors when it doesn't find the array metadata.

@TomNicholas TomNicholas added the references generation Reading byte ranges from archival files label Jun 25, 2024
@scottyhq
Copy link
Contributor Author

It's surprisingly hard to find HDF5 files out in the wild that don't have groups! If you know of any for testing let me know @TomNicholas , otherwise it seems like the group kwarg open_virtual_dataset(..., group='xyz') would be a good solution. Often the groups are unrelated enough that you really only want to work with one anyways.

@TomNicholas
Copy link
Member

TomNicholas commented Jun 25, 2024 via email

@scottyhq
Copy link
Contributor Author

Are you looking for HDF5 files that are not netCDF4 files?

Correct, I was. From the netCDF Docs: "Some HDF5 features not supported in netCDF-4 format include non-hierarchical group structures, HDF5 reference types, multiple links to a data object, user-defined atomic data types, stored property lists, more permissive rules for data object names, the HDF5 date/time type, and attributes associated with user-defined types"

Although, taking a step back I don't think it's worth seeking out examples of HDF5 that is not netCDF4 :) I'd guess many of the .h5 files out there actually conform to the netCDF4 subset and could've just as easily been named .nc ! Feel free to close this one as a duplicate of #84

@forrestfwilliams
Copy link

Just wanted to +1 this issue. I ran into it today while also trying to use VirtualiZarr on a netCDF file with multiple groups.

@TomNicholas
Copy link
Member

@forrestfwilliams I'm happy to review a PR! Adding a group kwarg to open_virtual_dataset should be pretty simple - you would just read all the references using kerchunk as it does now, then select out only the part of the nested references dict that corresponds to that group. Then use variables_from_kerchunk_refs on just that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working references generation Reading byte ranges from archival files
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants