Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added kerchunk as backend documentation #9163

Merged
merged 13 commits into from
Aug 2, 2024
1 change: 1 addition & 0 deletions ci/requirements/doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies:
- bottleneck
- cartopy
- cfgrib
- kerchunk
- dask-core>=2022.1
- dask-expr
- hypothesis>=6.75.8
Expand Down
30 changes: 30 additions & 0 deletions doc/combined.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"version": 1,
"refs": {
".zgroup": "{\"zarr_format\":2}",
"foo/.zarray": "{\"chunks\":[4,5],\"compressor\":null,\"dtype\":\"<f8\",\"fill_value\":\"NaN\",\"filters\":null,\"order\":\"C\",\"shape\":[4,5],\"zarr_format\":2}",
"foo/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\",\"y\"],\"coordinates\":\"z\"}",
"foo/0.0": [
"saved_on_disk.h5",
8192,
160
],
"x/.zarray": "{\"chunks\":[4],\"compressor\":null,\"dtype\":\"<i8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[4],\"zarr_format\":2}",
"x/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\"]}",
"x/0": [
"saved_on_disk.h5",
8352,
32
],
"y/.zarray": "{\"chunks\":[5],\"compressor\":null,\"dtype\":\"<i8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[5],\"zarr_format\":2}",
"y/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"y\"],\"calendar\":\"proleptic_gregorian\",\"units\":\"days since 2000-01-01 00:00:00\"}",
"y/0": [
"saved_on_disk.h5",
8384,
40
],
"z/.zarray": "{\"chunks\":[4],\"compressor\":null,\"dtype\":\"|O\",\"fill_value\":null,\"filters\":[{\"allow_nan\":true,\"check_circular\":true,\"encoding\":\"utf-8\",\"ensure_ascii\":true,\"id\":\"json2\",\"indent\":null,\"separators\":[\",\",\":\"],\"skipkeys\":false,\"sort_keys\":true,\"strict\":true}],\"order\":\"C\",\"shape\":[4],\"zarr_format\":2}",
"z/0": "[\"a\",\"b\",\"c\",\"d\",\"|O\",[4]]",
"z/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\"]}"
}
}
53 changes: 53 additions & 0 deletions doc/user-guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -985,6 +985,59 @@ reads. Because this fall-back option is so much slower, xarray issues a
instead of falling back to try reading non-consolidated metadata.


.. _io.kerchunk:

Kerchunk
--------

`Kerchunk <https://fsspec.github.io/kerchunk/index.html>`_ is a Python library
that allows you to access chunked and compressed data formats (such as NetCDF3, NetCDF4, HDF5, GRIB2, TIFF & FITS),
many of which are primary data formats for many data archives, by viewing the
whole archive as an ephemeral `Zarr`_ dataset which allows for parallel, chunk-specific access.

Instead of creating a new copy of the dataset in the Zarr spec/format or
downloading the files locally, Kerchunk reads through the data archive and extracts the
byte range and compression information of each chunk and saves as a ``reference``.
These references are then saved as ``json`` files or ``parquet`` (more efficient)
for later use. You can view some of these stored in the `references`
directory `here <https://github.com/pydata/xarray-data>`_.


.. note::
These references follow this `specification <https://fsspec.github.io/kerchunk/spec.html>`_.
Packages like `kerchunk`_ and `virtualizarr <https://github.com/zarr-developers/VirtualiZarr>`_
help in creating and reading these references.


Reading these data archives becomes really easy with ``kerchunk`` in combination
with ``xarray``, especially when these archives are large in size. A single combined
reference can refer to thousands of the original data files present in these archives.
You can view the whole dataset with from this `combined reference` using the above packages.

The following example shows opening a combined references generated from a ``.hdf`` file stored locally.

.. ipython:: python

storage_options = {
"target_protocol": "file",
}

# add the `remote_protocol` key in `storage_options` if you're accessing a file remotely

ds1 = xr.open_dataset(
"./combined.json",
engine="kerchunk",
storage_options=storage_options,
)

ds1

.. note::

You can refer to the `project pythia kerchunk cookbook <https://projectpythia.org/kerchunk-cookbook/README.html>`_
and the `pangeo guide on kerchunk <https://guide.cloudnativegeo.org/kerchunk/intro.html>`_ for more information.


.. _io.iris:

Iris
Expand Down
Loading