`open_virtual_dataset` fails when there is a subgroup #336

rabernat · 2024-12-07T21:53:00Z

(This issue is inspired by the NAS GES DISC GPM_3IMERGHH_07 dataset, which has this same structure. cc @abarciauskas-bgse)

Consider a NetCDF dataset that has a valid NetCDF group at one level of the hierarchy and then a sub-group beneath that. We can make one like this:

import xarray as xr
ds = xr.DataArray([1, 2, 3], name="foo").to_dataset()
ds.to_netcdf("test.nc")
# store the same data in a sub group
ds.to_netcdf("test.nc", group="subgroup", mode="a")

Xarray can open either group fine.

xr.open_dataset("test.nc")
xr.open_dataset("test.nc", group="subgroup")

For the root group, it just ignores the sub group.

However, VirtualiZarr doesn't like it

from virtualizarr import open_virtual_dataset
open_virtual_dataset("test.nc", group='')

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 1
----> 1 dsv = open_virtual_dataset("test.nc", group='')

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/backend.py:200](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/backend.py#line=199), in open_virtual_dataset(filepath, filetype, group, drop_variables, loadable_variables, decode_times, cftime_variables, indexes, virtual_array_class, virtual_backend_kwargs, reader_options, backend)
    197 if backend_cls is None:
    198     raise NotImplementedError(f"Unsupported file type: {filetype.name}")
--> 200 vds = backend_cls.open_virtual_dataset(
    201     filepath,
    202     group=group,
    203     drop_variables=drop_variables,
    204     loadable_variables=loadable_variables,
    205     decode_times=decode_times,
    206     indexes=indexes,
    207     virtual_backend_kwargs=virtual_backend_kwargs,
    208     reader_options=reader_options,
    209 )
    211 return vds

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf5.py:48](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/readers/hdf5.py#line=47), in HDF5VirtualBackend.open_virtual_dataset(filepath, group, drop_variables, loadable_variables, decode_times, indexes, virtual_backend_kwargs, reader_options)
     42 refs = SingleHdf5ToZarr(
     43     filepath, inline_threshold=0, **reader_options
     44 ).translate()
     46 refs = extract_group(refs, group)
---> 48 virtual_vars, attrs, coord_names = virtual_vars_and_metadata_from_kerchunk_refs(
     49     refs,
     50     loadable_variables,
     51     drop_variables,
     52     fs_root=Path.cwd().as_uri(),
     53 )
     55 loadable_vars, indexes = open_loadable_vars_and_indexes(
     56     filepath,
     57     loadable_variables=loadable_variables,
   (...)
     62     decode_times=decode_times,
     63 )
     65 return construct_virtual_dataset(
     66     virtual_vars=virtual_vars,
     67     loadable_vars=loadable_vars,
   (...)
     70     attrs=attrs,
     71 )

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:34](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=33), in virtual_vars_and_metadata_from_kerchunk_refs(vds_refs, loadable_variables, drop_variables, virtual_array_class, fs_root)
     17 def virtual_vars_and_metadata_from_kerchunk_refs(
     18     vds_refs: KerchunkStoreRefs,
     19     loadable_variables,
   (...)
     22     fs_root: str | None = None,
     23 ) -> tuple[Mapping[str, Variable], dict[str, Any], list[str]]:
     24     """
     25     Parses all useful information from a set kerchunk references (for a single group).
     26 
   (...)
     31         Required if any paths are relative in order to turn them into absolute paths (which virtualizarr requires).
     32     """
---> 34     virtual_vars = virtual_vars_from_kerchunk_refs(
     35         vds_refs,
     36         drop_variables=drop_variables + loadable_variables,
     37         virtual_array_class=virtual_array_class,
     38         fs_root=fs_root,
     39     )
     40     ds_attrs = fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})
     41     coord_names = ds_attrs.pop("coordinates", [])

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:110](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=109), in virtual_vars_from_kerchunk_refs(refs, drop_variables, virtual_array_class, fs_root)
    104     drop_variables = []
    105 var_names_to_keep = [
    106     var_name for var_name in var_names if var_name not in drop_variables
    107 ]
    109 vars = {
--> 110     var_name: variable_from_kerchunk_refs(
    111         refs, var_name, virtual_array_class, fs_root=fs_root
    112     )
    113     for var_name in var_names_to_keep
    114 }
    115 return vars

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:164](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=163), in variable_from_kerchunk_refs(refs, var_name, virtual_array_class, fs_root)
    161 """Create a single xarray Variable by reading specific keys of a kerchunk references dict."""
    163 arr_refs = extract_array_refs(refs, var_name)
--> 164 chunk_dict, zarray, zattrs = parse_array_refs(arr_refs)
    165 # we want to remove the _ARRAY_DIMENSIONS from the final variables' .attrs
    166 dims = zattrs.pop("_ARRAY_DIMENSIONS")

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:258](https://hub.openveda.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py#line=257), in parse_array_refs(arr_refs)
    255 def parse_array_refs(
    256     arr_refs: KerchunkArrRefs,
    257 ) -> tuple[dict, ZArray, ZAttrs]:
--> 258     zarray = ZArray.from_kerchunk_refs(arr_refs.pop(".zarray"))
    259     zattrs = arr_refs.pop(".zattrs", {})
    260     chunk_dict = arr_refs

KeyError: '.zarray'

It looks like VirtualiZarr is assuming that all child nodes in the hierarchy are arrays, not groups.

I'm also curious why we are not going through the new Non-kerchunk backend for HDF5/netcdf4 files, rather than the kerchunk backend. How do you turn that on? (I'm on 1.2.0.)

Related to #84

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-12-07T23:41:17Z

hmm a very similar error was raised in #159 and I thought fixed in #165. Tagging @scottyhq in case he has any insights.

There might be something about this file that causes kerchunk to return references in a slightly different format than for the other files we have tested (you can see it's failing at the point of trying to translate kerchunk's in-memory references format to virtualizarr's ManifestArrays).

I'm also curious why we are not going through the new #87, rather than the kerchunk backend. How do you turn that on? (I'm on 1.2.0.)

We decided it wasn't mature enough to just switch to using that as the default yet. But you can turn it on by importing this class and passing that in via

from virtualizarr.readers.hdf import HDFVirtualBackend

vds = open_virtual_dataset('file.nc', backend=HDFVirtualBackend)

TomNicholas · 2024-12-07T23:43:55Z

Also is there a reason you're doing open_virtual_dataset("test.nc", group='') with group='' instead of group='subgroup'?

rabernat · 2024-12-08T01:05:18Z

Because I want to make a virtual dataset for the root group, not the subgroup. If I leave out the group argument entirely, it complains about "multiple groups found."

TomNicholas · 2024-12-08T01:41:02Z

Because I want to make a virtual dataset for the root group, not the subgroup. If I leave out the group argument entirely, it complains about "multiple groups found."

Right. Now you say that, it seems obvious that the behaviour instead should match xarray's, where the default is to return the contents of the root group. I'm looking at what went wrong here now.

TomNicholas · 2024-12-09T03:20:00Z

@rabernat #338 should fix this, and change the "multiple groups found" behaviour to be more consistent with xarray.

TomNicholas added bug Something isn't working references generation Reading byte ranges from archival files labels Dec 7, 2024

TomNicholas mentioned this issue Dec 9, 2024

Fix group kwarg #338

Merged

7 tasks

TomNicholas added the Kerchunk Relating to the kerchunk library / specification itself label Dec 9, 2024

TomNicholas closed this as completed in #338 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`open_virtual_dataset` fails when there is a subgroup #336

`open_virtual_dataset` fails when there is a subgroup #336

rabernat commented Dec 7, 2024 •

edited

Loading

TomNicholas commented Dec 7, 2024 •

edited

Loading

TomNicholas commented Dec 7, 2024

rabernat commented Dec 8, 2024

TomNicholas commented Dec 8, 2024

TomNicholas commented Dec 9, 2024 •

edited

Loading

open_virtual_dataset fails when there is a subgroup #336

open_virtual_dataset fails when there is a subgroup #336

Comments

rabernat commented Dec 7, 2024 • edited Loading

TomNicholas commented Dec 7, 2024 • edited Loading

TomNicholas commented Dec 7, 2024

rabernat commented Dec 8, 2024

TomNicholas commented Dec 8, 2024

TomNicholas commented Dec 9, 2024 • edited Loading

`open_virtual_dataset` fails when there is a subgroup #336

`open_virtual_dataset` fails when there is a subgroup #336

rabernat commented Dec 7, 2024 •

edited

Loading

TomNicholas commented Dec 7, 2024 •

edited

Loading

TomNicholas commented Dec 9, 2024 •

edited

Loading