Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'refs' opening TIFF file #160

Closed
scottyhq opened this issue Jun 25, 2024 · 3 comments · Fixed by #162
Closed

KeyError: 'refs' opening TIFF file #160

scottyhq opened this issue Jun 25, 2024 · 3 comments · Fixed by #162
Labels
bug Something isn't working Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files upstream issue

Comments

@scottyhq
Copy link
Contributor

scottyhq commented Jun 25, 2024

import kerchunk.tiff
from virtualizarr import open_virtual_dataset

url = 'https://github.com/fsspec/kerchunk/raw/main/kerchunk/tests/lcmap_tiny_cog_2020.tif'
kerchunk.tiff.tiff_to_zarr(url)

vds = open_virtual_dataset(url, reader_options={})
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 vds = open_virtual_dataset(url, reader_options={})

File ~/GitHub/VirtualiZarr/virtualizarr/xarray.py:114, in open_virtual_dataset(filepath, filetype, drop_variables, loadable_variables, indexes, virtual_array_class, reader_options)
    106 else:
    107     # this is the only place we actually always need to use kerchunk directly
    108     # TODO avoid even reading byte ranges for variables that will be dropped later anyway?
    109     vds_refs = kerchunk.read_kerchunk_references_from_file(
    110         filepath=filepath,
    111         filetype=filetype,
    112         reader_options=reader_options,
    113     )
--> 114     virtual_vars = virtual_vars_from_kerchunk_refs(
    115         vds_refs,
    116         drop_variables=drop_variables + loadable_variables,
    117         virtual_array_class=virtual_array_class,
    118     )
    119     ds_attrs = kerchunk.fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})
    120     coord_names = ds_attrs.pop("coordinates", [])

File ~/GitHub/VirtualiZarr/virtualizarr/xarray.py:239, in virtual_vars_from_kerchunk_refs(refs, drop_variables, virtual_array_class)
    222 def virtual_vars_from_kerchunk_refs(
    223     refs: KerchunkStoreRefs,
    224     drop_variables: list[str] | None = None,
    225     virtual_array_class=ManifestArray,
    226 ) -> Mapping[str, xr.Variable]:
    227     """
    228     Translate a store-level kerchunk reference dict into aaset of xarray Variables containing virtualized arrays.
    229 
   (...)
    236         Currently can only be ManifestArray, but once VirtualZarrArray is implemented the default should be changed to that.
    237     """
--> 239     var_names = kerchunk.find_var_names(refs)
    240     if drop_variables is None:
    241         drop_variables = []

File ~/GitHub/VirtualiZarr/virtualizarr/kerchunk.py:154, in find_var_names(ds_reference_dict)
    151 def find_var_names(ds_reference_dict: KerchunkStoreRefs) -> list[str]:
    152     """Find the names of zarr variables in this store/group."""
--> 154     refs = ds_reference_dict["refs"]
    155     found_var_names = [key.split("/")[0] for key in refs.keys() if "/" in key]
    156     return found_var_names

KeyError: 'refs'
@TomNicholas
Copy link
Member

Thanks for raising this @scottyhq! I don't think anyone has actually tried to open a tiff file with virtualizarr before!

Running your example, the output of kerchunk.tiff.tiff_to_zarr(url) looks like

{
  '.zgroup': '{\n "zarr_format": 2\n}',
  '.zattrs': '{"multiscales":[{"datasets":[{"path":"0"},{"path":"1"},{"path":"2"}],"metadata":{},"name":"","version":"0.1"}],"OVR_RESAMPLING_ALG":"NEAREST","LAYOUT":"IFDS_BEFORE_DATA","BLOCK_ORDER":"ROW_MAJOR","BLOCK_LEADER":"SIZE_AS_UINT4","BLOCK_TRAILER":"LAST_4_BYTES_REPEATED","KNOWN_INCOMPATIBLE_EDITION":"NO","KeyDirectoryVersion":1,"KeyRevision":1,"KeyRevisionMinor":0,"GTModelTypeGeoKey":1,"GTRasterTypeGeoKey":1,"GTCitationGeoKey":"Albers","GeographicTypeGeoKey":4326,"GeogCitationGeoKey":"WGS 84","GeogAngularUnitsGeoKey":9102,"GeogSemiMajorAxisGeoKey":6378140.0,"GeogInvFlatteningGeoKey":298.256999999996,"ProjectedCSTypeGeoKey":32767,"ProjectionGeoKey":32767,"ProjCoordTransGeoKey":11,"ProjLinearUnitsGeoKey":9001,"ProjStdParallel1GeoKey":29.5,"ProjStdParallel2GeoKey":45.5,"ProjNatOriginLongGeoKey":-96.0,"ProjNatOriginLatGeoKey":23.0,"ProjFalseEastingGeoKey":0.0,"ProjFalseNorthingGeoKey":0.0,"ModelPixelScale":[30.0,30.0,0.0],"ModelTiepoint":[0.0,0.0,0.0,-1801185.0,2700405.0,0.0]}',
  '0/.zattrs': '{\n "_ARRAY_DIMENSIONS": [\n  "Y",\n  "X"\n ]\n}',
  '0/.zarray': '{\n "chunks": [\n  512,\n  512\n ],\n "compressor": {\n  "id": "zlib"\n },\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n  2048,\n  2048\n ],\n "zarr_format": 2\n}',
  ...,
}

It looks like this is not the same structure that e.g. kerchunk.hdf.SingleHdf5ToZarr returns.

What virtualizarr is expecting (and what the kerchunk docs promise...) is that the keys of the outermost dictionary are 'refs' and 'version'. This kerchunk.tiff.tiff_to_zarr(url) function seems to have jumped straight to giving us the contents that would normally be underneath the 'refs' key.

We could either fix this upstream in kerchunk, or just work around it here by special-casing tiffs to add that top-level {'refs': ...} ourselves. I vote for the latter.

@TomNicholas TomNicholas added bug Something isn't working Kerchunk Relating to the kerchunk library / specification itself upstream issue references generation Reading byte ranges from archival files labels Jun 25, 2024
@scottyhq
Copy link
Contributor Author

Makes sense @TomNicholas, seems like if someone is motivated an upstream fix is the right approach. I just came across this working on #143 so figured I'd document it. For what it's worth the it's the same situation currently with FITS:

url = 'https://fits.gsfc.nasa.gov/samples/WFPC2u5780205r_c0fx.fits'
kerchunk.fits.process_file(url)

@TomNicholas
Copy link
Member

Yeah thanks for documenting it!

We should raise an issue upstream to report it, but as long as there are no other differences in the structure then working around it here should be very simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files upstream issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants