Non-kerchunk backend for HDF5/netcdf4 files. (#87)

* Generate chunk manifest backed variable from HDF5 dataset. * Transfer dataset attrs to variable. * Get virtual variables dict from HDF5 file. * Update virtual_vars_from_hdf to use fsspec and drop_variables arg. * mypy fix to use ChunkKey and empty dimensions list. * Extract attributes from hdf5 root group. * Use hdf reader for netcdf4 files. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix ruff complaints. * First steps for handling HDF5 filters. * Initial step for hdf5plugin supported codecs. * Small commit to check compression support in CI environment. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix mypy complaints for hdf_filters. * Local pre-commit fix for hdf_filters. * Use fsspec reader_options introduced in #37. * Fix incorrect zarr_v3 if block position from merge commit ef0d7a8. * Fix early return from hdf _extract_attrs. * Test that _extract_attrs correctly handles multiple attributes. * Initial attempt at scale and offset via numcodecs. * Tests for cfcodec_from_dataset. * Temporarily relax integration tests to assert_allclose. * Add blosc_lz4 fixture parameterization to confirm libnetcdf environment. * Check for compatability with netcdf4 engine. * Use separate fixtures for h5netcdf and netcdf4 compression styles. * Print libhdf5 and libnetcdf4 versions to confirm compiled environment. * Skip netcdf4 style compression tests when libhdf5 < 1.14. * Include imagecodecs.numcodecs to support HDF5 lzf filters. * Remove test that verifies call to read_kerchunk_references_from_file. * Add additional codec support structures for imagecodecs and numcodecs. * Add codec config test for Zstd. * Include initial cf decoding tests. * Revert typo for scale_factor retrieval. * Update reader to use new numpy manifest representation. * Temporarily skip test until blosc netcdf4 issue is solved. * Fix Pydantic 2 migration warnings. * Include hdf5plugin and imagecodecs-numcodecs in mamba test environment. * Mamba attempt with imagecodecs rather than imagecodecs-numcodecs. * Mamba attempt with latest imagecodecs release. * Use correct iter_chunks callback function signtature. * Include pip based imagecodecs-numcodecs until conda-forge availability. * Handle non-coordinate dims which are serialized to hdf as empty dataset. * Use reader_options for filetype check and update failing kerchunk call. * Fix chunkmanifest shaping for chunked datasets. * Handle scale_factor attribute serialization for compressed files. * Include chunked roundtrip fixture. * Standardize xarray integration tests for hdf filters. * Update reader selection logic for new filetype determination. * Use decode_times for integration test. * Standardize fixture names for hdf5 vs netcdf4 file types. * Handle array add_offset property for compressed data. * Include h5py shuffle filter. * Make ScaleAndOffset codec last in filters list. * Apply ScaleAndOffset codec to _FillValue since it's value is now downstream. * Coerce scale and add_offset values to native float for JSON serialization. * Temporarily xfail integration tests for main * Remove pydantic dependency as per pull/210. * Update test for new kerchunk reader module location. * Fix branch typing errors. * Re-include automatic file type determination. * Handle various hdf flavors of _FillValue storage. * Include loadable variables in drop variables list. * Mock readers.hdf.virtual_vars_from_hdf to verify option passing. * Convert numpy _FillValue to native Python for serialization support. * Support groups with HDF5 reader. * Handle empty variables with a shape. * Import top-level version of xarray classes. * Add option to explicitly specify use of an experimental hdf backend. * Include imagecodecs and hdf5plugin in all CI environments. * Add test_hdf_integration tests to be skipped for non-kerchunk env. * Include imagecodecs in dependencies. * Diagnose imagecodecs-numcodecs installation failures in CI. * Ignore mypy complaints for VirtualBackend. * Remove checksum assert which varies across different zstd versions. * Temporarily xfail integration tests with coordinate inconsistency. * Remove backend arg for non-hdf network file tests. * Fix mypy comment moved by ruff formatting. * Make HDR reader dependencies optional. * Handle optional imagecodecs and hdf5plugin dependency imports for tests. * Prevent conflicts with explicit filetype and backend args. * Correctly convert root coordinate attributes to a list. * Clarify that method extracts attrs from any specified group. * Restructure hdf reader and codec filters into a module namespace. * Improve docstrings for hdf and filter modules. * Explicitly specify HDF5VirtualBackend for test parameter. * Include isssue references for xfailed tests. * Use soft import strategy for optional dependencies see xarray/issues/9554. * Handle mypy for soft imports. * Attempt at nested optional depedency usage. * Handle use of soft import sub modules for typing. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
zarr-developers · Nov 19, 2024 · 647d175 · 647d175
1 parent 5c3e204
commit 647d175
Show file tree

Hide file tree

Showing 18 changed files with 1,439 additions and 64 deletions.
diff --git a/ci/environment.yml b/ci/environment.yml
@@ -13,6 +13,8 @@ dependencies:
   - ujson
   - packaging
   - universal_pathlib
+  - hdf5plugin
+  - numcodecs
   # Testing
   - codecov
   - pre-commit
@@ -27,7 +29,11 @@ dependencies:
   - fsspec
   - s3fs
   - fastparquet
+  - imagecodecs>=2024.6.1
   # for opening tiff files
   - tifffile
   # for opening FITS files
   - astropy
+  - pip
+  - pip:
+    - imagecodecs-numcodecs==2024.6.1
diff --git a/ci/min-deps.yml b/ci/min-deps.yml
@@ -3,7 +3,6 @@ channels:
   - conda-forge
   - nodefaults
 dependencies:
-  - h5netcdf
   - h5py
   - hdf5
   - netcdf4

diff --git a/ci/upstream.yml b/ci/upstream.yml
@@ -12,6 +12,9 @@ dependencies:
   - packaging
   - ujson
   - universal_pathlib
+  - hdf5plugin
+  - numcodecs
+  - imagecodecs>=2024.6.1
   # Testing
   - codecov
   - pre-commit
@@ -27,3 +30,4 @@ dependencies:
   - pip:
     - icechunk # Installs zarr v3 as dependency
     # - git+https://github.com/fsspec/kerchunk@main  # kerchunk is currently incompatible with zarr-python v3 (https://github.com/fsspec/kerchunk/pull/516)
+    - imagecodecs-numcodecs==2024.6.1
diff --git a/pyproject.toml b/pyproject.toml
@@ -30,15 +30,23 @@ dependencies = [
 ]
 
 [project.optional-dependencies]
+hdf_reader = [
+    "fsspec",
+    "h5py",
+    "hdf5plugin",
+    "imagecodecs",
+    "imagecodecs-numcodecs==2024.6.1",
+    "numcodecs"
+]
 test = [
     "codecov",
     "fastparquet",
     "fsspec",
-    "h5netcdf",
     "h5py",
     "kerchunk>=0.2.5",
     "mypy",
     "netcdf4",
+    "numcodecs",
     "pandas-stubs",
     "pooch",
     "pre-commit",
@@ -48,6 +56,7 @@ test = [
     "ruff",
     "s3fs",
     "scipy",
+    "virtualizarr[hdf_reader]"
 ]
 
 

diff --git a/virtualizarr/backend.py b/virtualizarr/backend.py
@@ -16,8 +16,10 @@
     HDF5VirtualBackend,
     KerchunkVirtualBackend,
     NetCDF3VirtualBackend,
+    TIFFVirtualBackend,
     ZarrV3VirtualBackend,
 )
+from virtualizarr.readers.common import VirtualBackend
 from virtualizarr.utils import _FsspecFSFromFilepath, check_for_collisions
 
 # TODO add entrypoint to allow external libraries to add to this mapping
@@ -26,10 +28,10 @@
     "zarr_v3": ZarrV3VirtualBackend,
     "dmrpp": DMRPPVirtualBackend,
     # all the below call one of the kerchunk backends internally (https://fsspec.github.io/kerchunk/reference.html#file-format-backends)
-    "netcdf3": NetCDF3VirtualBackend,
     "hdf5": HDF5VirtualBackend,
     "netcdf4": HDF5VirtualBackend,  # note this is the same as for hdf5
-    # "tiff": TIFFVirtualBackend,
+    "netcdf3": NetCDF3VirtualBackend,
+    "tiff": TIFFVirtualBackend,
     "fits": FITSVirtualBackend,
 }
 
@@ -112,6 +114,7 @@ def open_virtual_dataset(
     indexes: Mapping[str, Index] | None = None,
     virtual_array_class=ManifestArray,
     reader_options: Optional[dict] = None,
+    backend: Optional[VirtualBackend] = None,
 ) -> Dataset:
     """
     Open a file or store as an xarray Dataset wrapping virtualized zarr arrays.
@@ -173,15 +176,20 @@ def open_virtual_dataset(
     if reader_options is None:
         reader_options = {}
 
+    if backend and filetype:
+        raise ValueError("Cannot pass both a filetype and an explicit VirtualBackend")
+
     if filetype is not None:
         # if filetype is user defined, convert to FileType
         filetype = FileType(filetype)
     else:
         filetype = automatically_determine_filetype(
             filepath=filepath, reader_options=reader_options
         )
-
-    backend_cls = VIRTUAL_BACKENDS.get(filetype.name.lower())
+    if backend:
+        backend_cls = backend
+    else:
+        backend_cls = VIRTUAL_BACKENDS.get(filetype.name.lower())  # type: ignore
 
     if backend_cls is None:
         raise NotImplementedError(f"Unsupported file type: {filetype.name}")

diff --git a/virtualizarr/readers/__init__.py b/virtualizarr/readers/__init__.py
@@ -1,5 +1,6 @@
 from virtualizarr.readers.dmrpp import DMRPPVirtualBackend
 from virtualizarr.readers.fits import FITSVirtualBackend
+from virtualizarr.readers.hdf import HDFVirtualBackend
 from virtualizarr.readers.hdf5 import HDF5VirtualBackend
 from virtualizarr.readers.kerchunk import KerchunkVirtualBackend
 from virtualizarr.readers.netcdf3 import NetCDF3VirtualBackend
@@ -9,6 +10,7 @@
 __all__ = [
     "DMRPPVirtualBackend",
     "FITSVirtualBackend",
+    "HDFVirtualBackend",
     "HDF5VirtualBackend",
     "KerchunkVirtualBackend",
     "NetCDF3VirtualBackend",

diff --git a/virtualizarr/readers/hdf/__init__.py b/virtualizarr/readers/hdf/__init__.py
@@ -0,0 +1,11 @@
+from .hdf import (
+    HDFVirtualBackend,
+    construct_virtual_dataset,
+    open_loadable_vars_and_indexes,
+)
+
+__all__ = [
+    "HDFVirtualBackend",
+    "construct_virtual_dataset",
+    "open_loadable_vars_and_indexes",
+]
diff --git a/virtualizarr/readers/hdf/filters.py b/virtualizarr/readers/hdf/filters.py
@@ -0,0 +1,195 @@
+import dataclasses
+from typing import TYPE_CHECKING, List, Tuple, TypedDict, Union
+
+import numcodecs.registry as registry
+import numpy as np
+from numcodecs.abc import Codec
+from numcodecs.fixedscaleoffset import FixedScaleOffset
+from xarray.coding.variables import _choose_float_dtype
+
+from virtualizarr.utils import soft_import
+
+if TYPE_CHECKING:
+    import h5py  # type: ignore
+    from h5py import Dataset, Group  # type: ignore
+
+h5py = soft_import("h5py", "For reading hdf files", strict=False)
+if h5py:
+    Dataset = h5py.Dataset
+    Group = h5py.Group
+else:
+    Dataset = dict()
+    Group = dict()
+
+hdf5plugin = soft_import(
+    "hdf5plugin", "For reading hdf files with filters", strict=False
+)
+imagecodecs = soft_import(
+    "imagecodecs", "For reading hdf files with filters", strict=False
+)
+
+
+_non_standard_filters = {
+    "gzip": "zlib",
+    "lzf": "imagecodecs_lzf",
+}
+
+_hdf5plugin_imagecodecs = {"lz4": "imagecodecs_lz4h5", "bzip2": "imagecodecs_bz2"}
+
+
+@dataclasses.dataclass
+class BloscProperties:
+    blocksize: int
+    clevel: int
+    shuffle: int
+    cname: str
+
+    def __post_init__(self):
+        blosc_compressor_codes = {
+            value: key
+            for key, value in hdf5plugin._filters.Blosc._Blosc__COMPRESSIONS.items()
+        }
+        self.cname = blosc_compressor_codes[self.cname]
+
+
+@dataclasses.dataclass
+class ZstdProperties:
+    level: int
+
+
+@dataclasses.dataclass
+class ShuffleProperties:
+    elementsize: int
+
+
+@dataclasses.dataclass
+class ZlibProperties:
+    level: int
+
+
+class CFCodec(TypedDict):
+    target_dtype: np.dtype
+    codec: Codec
+
+
+def _filter_to_codec(
+    filter_id: str, filter_properties: Union[int, None, Tuple] = None
+) -> Codec:
+    """
+    Convert an h5py filter to an equivalent numcodec
+
+    Parameters
+    ----------
+    filter_id: str
+        An h5py filter id code.
+    filter_properties : int or None or Tuple
+        A single or Tuple of h5py filter configuration codes.
+
+    Returns
+    -------
+        A numcodec codec
+    """
+    id_int = None
+    id_str = None
+    try:
+        id_int = int(filter_id)
+    except ValueError:
+        id_str = filter_id
+    conf = {}
+    if id_str:
+        if id_str in _non_standard_filters.keys():
+            id = _non_standard_filters[id_str]
+        else:
+            id = id_str
+        if id == "zlib":
+            zlib_props = ZlibProperties(level=filter_properties)  # type: ignore
+            conf = dataclasses.asdict(zlib_props)
+        if id == "shuffle" and isinstance(filter_properties, tuple):
+            shuffle_props = ShuffleProperties(elementsize=filter_properties[0])
+            conf = dataclasses.asdict(shuffle_props)
+        conf["id"] = id  # type: ignore[assignment]
+    if id_int:
+        filter = hdf5plugin.get_filters(id_int)[0]
+        id = filter.filter_name
+        if id in _hdf5plugin_imagecodecs.keys():
+            id = _hdf5plugin_imagecodecs[id]
+        if id == "blosc" and isinstance(filter_properties, tuple):
+            blosc_fields = [field.name for field in dataclasses.fields(BloscProperties)]
+            blosc_props = BloscProperties(
+                **{k: v for k, v in zip(blosc_fields, filter_properties[-4:])}
+            )
+            conf = dataclasses.asdict(blosc_props)
+        if id == "zstd" and isinstance(filter_properties, tuple):
+            zstd_props = ZstdProperties(level=filter_properties[0])
+            conf = dataclasses.asdict(zstd_props)
+        conf["id"] = id
+    codec = registry.get_codec(conf)
+    return codec
+
+
+def cfcodec_from_dataset(dataset: Dataset) -> Codec | None:
+    """
+    Converts select h5py dataset CF convention attrs to CFCodec
+
+    Parameters
+    ----------
+    dataset: h5py.Dataset
+       An h5py dataset.
+
+    Returns
+    -------
+    CFCodec
+        A CFCodec.
+    """
+    attributes = {attr: dataset.attrs[attr] for attr in dataset.attrs}
+    mapping = {}
+    if "scale_factor" in attributes:
+        try:
+            scale_factor = attributes["scale_factor"][0]
+        except IndexError:
+            scale_factor = attributes["scale_factor"]
+        mapping["scale_factor"] = float(1 / scale_factor)
+    else:
+        mapping["scale_factor"] = 1
+    if "add_offset" in attributes:
+        try:
+            offset = attributes["add_offset"][0]
+        except IndexError:
+            offset = attributes["add_offset"]
+        mapping["add_offset"] = float(offset)
+    else:
+        mapping["add_offset"] = 0
+    if mapping["scale_factor"] != 1 or mapping["add_offset"] != 0:
+        float_dtype = _choose_float_dtype(dtype=dataset.dtype, mapping=mapping)
+        target_dtype = np.dtype(float_dtype)
+        codec = FixedScaleOffset(
+            offset=mapping["add_offset"],
+            scale=mapping["scale_factor"],
+            dtype=target_dtype,
+            astype=dataset.dtype,
+        )
+        cfcodec = CFCodec(target_dtype=target_dtype, codec=codec)
+        return cfcodec
+    else:
+        return None
+
+
+def codecs_from_dataset(dataset: Dataset) -> List[Codec]:
+    """
+    Extracts a list of numcodecs from an h5py dataset
+
+    Parameters
+    ----------
+    dataset: h5py.Dataset
+       An h5py dataset.
+
+    Returns
+    -------
+    list
+        A list of numcodecs codecs.
+    """
+    codecs = []
+    for filter_id, filter_properties in dataset._filters.items():
+        codec = _filter_to_codec(filter_id, filter_properties)
+        codecs.append(codec)
+    return codecs