Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_virtual_dataset fails when there is a subgroup #336

Closed
rabernat opened this issue Dec 7, 2024 · 5 comments · Fixed by #338
Closed

open_virtual_dataset fails when there is a subgroup #336

rabernat opened this issue Dec 7, 2024 · 5 comments · Fixed by #338
Labels
bug Something isn't working Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files

Comments

@rabernat
Copy link
Collaborator

rabernat commented Dec 7, 2024

(This issue is inspired by the NAS GES DISC GPM_3IMERGHH_07 dataset, which has this same structure. cc @abarciauskas-bgse)

Consider a NetCDF dataset that has a valid NetCDF group at one level of the hierarchy and then a sub-group beneath that. We can make one like this:

import xarray as xr
ds = xr.DataArray([1, 2, 3], name="foo").to_dataset()
ds.to_netcdf("test.nc")
# store the same data in a sub group
ds.to_netcdf("test.nc", group="subgroup", mode="a")

Xarray can open either group fine.

xr.open_dataset("test.nc")
xr.open_dataset("test.nc", group="subgroup")

For the root group, it just ignores the sub group.

However, VirtualiZarr doesn't like it

from virtualizarr import open_virtual_dataset
open_virtual_dataset("test.nc", group='')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 1
----> 1 dsv = open_virtual_dataset("test.nc", group='')

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/backend.py:200](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/backend.py#line=199), in open_virtual_dataset(filepath, filetype, group, drop_variables, loadable_variables, decode_times, cftime_variables, indexes, virtual_array_class, virtual_backend_kwargs, reader_options, backend)
    197 if backend_cls is None:
    198     raise NotImplementedError(f"Unsupported file type: {filetype.name}")
--> 200 vds = backend_cls.open_virtual_dataset(
    201     filepath,
    202     group=group,
    203     drop_variables=drop_variables,
    204     loadable_variables=loadable_variables,
    205     decode_times=decode_times,
    206     indexes=indexes,
    207     virtual_backend_kwargs=virtual_backend_kwargs,
    208     reader_options=reader_options,
    209 )
    211 return vds

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf5.py:48](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf5.py#line=47), in HDF5VirtualBackend.open_virtual_dataset(filepath, group, drop_variables, loadable_variables, decode_times, indexes, virtual_backend_kwargs, reader_options)
     42 refs = SingleHdf5ToZarr(
     43     filepath, inline_threshold=0, **reader_options
     44 ).translate()
     46 refs = extract_group(refs, group)
---> 48 virtual_vars, attrs, coord_names = virtual_vars_and_metadata_from_kerchunk_refs(
     49     refs,
     50     loadable_variables,
     51     drop_variables,
     52     fs_root=Path.cwd().as_uri(),
     53 )
     55 loadable_vars, indexes = open_loadable_vars_and_indexes(
     56     filepath,
     57     loadable_variables=loadable_variables,
   (...)
     62     decode_times=decode_times,
     63 )
     65 return construct_virtual_dataset(
     66     virtual_vars=virtual_vars,
     67     loadable_vars=loadable_vars,
   (...)
     70     attrs=attrs,
     71 )

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:34](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=33), in virtual_vars_and_metadata_from_kerchunk_refs(vds_refs, loadable_variables, drop_variables, virtual_array_class, fs_root)
     17 def virtual_vars_and_metadata_from_kerchunk_refs(
     18     vds_refs: KerchunkStoreRefs,
     19     loadable_variables,
   (...)
     22     fs_root: str | None = None,
     23 ) -> tuple[Mapping[str, Variable], dict[str, Any], list[str]]:
     24     """
     25     Parses all useful information from a set kerchunk references (for a single group).
     26 
   (...)
     31         Required if any paths are relative in order to turn them into absolute paths (which virtualizarr requires).
     32     """
---> 34     virtual_vars = virtual_vars_from_kerchunk_refs(
     35         vds_refs,
     36         drop_variables=drop_variables + loadable_variables,
     37         virtual_array_class=virtual_array_class,
     38         fs_root=fs_root,
     39     )
     40     ds_attrs = fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})
     41     coord_names = ds_attrs.pop("coordinates", [])

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:110](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=109), in virtual_vars_from_kerchunk_refs(refs, drop_variables, virtual_array_class, fs_root)
    104     drop_variables = []
    105 var_names_to_keep = [
    106     var_name for var_name in var_names if var_name not in drop_variables
    107 ]
    109 vars = {
--> 110     var_name: variable_from_kerchunk_refs(
    111         refs, var_name, virtual_array_class, fs_root=fs_root
    112     )
    113     for var_name in var_names_to_keep
    114 }
    115 return vars

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:164](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=163), in variable_from_kerchunk_refs(refs, var_name, virtual_array_class, fs_root)
    161 """Create a single xarray Variable by reading specific keys of a kerchunk references dict."""
    163 arr_refs = extract_array_refs(refs, var_name)
--> 164 chunk_dict, zarray, zattrs = parse_array_refs(arr_refs)
    165 # we want to remove the _ARRAY_DIMENSIONS from the final variables' .attrs
    166 dims = zattrs.pop("_ARRAY_DIMENSIONS")

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:258](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=257), in parse_array_refs(arr_refs)
    255 def parse_array_refs(
    256     arr_refs: KerchunkArrRefs,
    257 ) -> tuple[dict, ZArray, ZAttrs]:
--> 258     zarray = ZArray.from_kerchunk_refs(arr_refs.pop(".zarray"))
    259     zattrs = arr_refs.pop(".zattrs", {})
    260     chunk_dict = arr_refs

KeyError: '.zarray'

It looks like VirtualiZarr is assuming that all child nodes in the hierarchy are arrays, not groups.


I'm also curious why we are not going through the new Non-kerchunk backend for HDF5/netcdf4 files, rather than the kerchunk backend. How do you turn that on? (I'm on 1.2.0.)

Related to #84

@TomNicholas TomNicholas added bug Something isn't working references generation Reading byte ranges from archival files labels Dec 7, 2024
@TomNicholas
Copy link
Member

TomNicholas commented Dec 7, 2024

hmm a very similar error was raised in #159 and I thought fixed in #165. Tagging @scottyhq in case he has any insights.

There might be something about this file that causes kerchunk to return references in a slightly different format than for the other files we have tested (you can see it's failing at the point of trying to translate kerchunk's in-memory references format to virtualizarr's ManifestArrays).

I'm also curious why we are not going through the new #87, rather than the kerchunk backend. How do you turn that on? (I'm on 1.2.0.)

We decided it wasn't mature enough to just switch to using that as the default yet. But you can turn it on by importing this class and passing that in via

from virtualizarr.readers.hdf import HDFVirtualBackend

vds = open_virtual_dataset('file.nc', backend=HDFVirtualBackend)

@TomNicholas
Copy link
Member

Also is there a reason you're doing open_virtual_dataset("test.nc", group='') with group='' instead of group='subgroup'?

@rabernat
Copy link
Collaborator Author

rabernat commented Dec 8, 2024

Because I want to make a virtual dataset for the root group, not the subgroup. If I leave out the group argument entirely, it complains about "multiple groups found."

@TomNicholas
Copy link
Member

Because I want to make a virtual dataset for the root group, not the subgroup. If I leave out the group argument entirely, it complains about "multiple groups found."

Right. Now you say that, it seems obvious that the behaviour instead should match xarray's, where the default is to return the contents of the root group. I'm looking at what went wrong here now.

@TomNicholas TomNicholas mentioned this issue Dec 9, 2024
7 tasks
@TomNicholas
Copy link
Member

TomNicholas commented Dec 9, 2024

@rabernat #338 should fix this, and change the "multiple groups found" behaviour to be more consistent with xarray.

@TomNicholas TomNicholas added the Kerchunk Relating to the kerchunk library / specification itself label Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants