Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added kerchunk as backend documentation #9163

Merged
merged 13 commits into from
Aug 2, 2024
2 changes: 2 additions & 0 deletions ci/requirements/doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ dependencies:
- bottleneck
- cartopy
- cfgrib
- kerchunk
- s3fs
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved
- dask-core>=2022.1
- dask-expr
- hypothesis>=6.75.8
Expand Down
1 change: 1 addition & 0 deletions doc/combined.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"version": 1, "refs": {".zgroup": "{\"zarr_format\":2}", "foo/.zarray": "{\"chunks\":[4,5],\"compressor\":null,\"dtype\":\"<f8\",\"fill_value\":\"NaN\",\"filters\":null,\"order\":\"C\",\"shape\":[4,5],\"zarr_format\":2}", "foo/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\",\"y\"],\"coordinates\":\"z\"}", "foo/0.0": "base64:Dxw+e/RK6z/Iau/qGyXOPxjE5O6St8I/GnghejTg4j9GXx3xTV7iPwBbNsnGXq8/y7DgEsXk4j9U8rt0n2fPPw67F5DZydU/wj5q2OWC7z/VC9L/32ztP/C4P4XPVqM/Ry/m0M+R6z9cC0BpPB3oP0RRdt9y7tk/9GDp81P81T8w7i8oneDFP0YWAN0XQtk/rfwIfoeI5D+6TEUh7JLRPw==", "x/.zarray": "{\"chunks\":[4],\"compressor\":null,\"dtype\":\"<i8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[4],\"zarr_format\":2}", "x/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\"]}", "x/0": "\n\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0014\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u001e\u0000\u0000\u0000\u0000\u0000\u0000\u0000(\u0000\u0000\u0000\u0000\u0000\u0000\u0000", "y/.zarray": "{\"chunks\":[5],\"compressor\":null,\"dtype\":\"<i8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[5],\"zarr_format\":2}", "y/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"y\"],\"calendar\":\"proleptic_gregorian\",\"units\":\"days since 2000-01-01 00:00:00\"}", "y/0": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0003\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0000\u0000\u0000\u0000", "z/.zarray": "{\"chunks\":[4],\"compressor\":null,\"dtype\":\"|O\",\"fill_value\":null,\"filters\":[{\"allow_nan\":true,\"check_circular\":true,\"encoding\":\"utf-8\",\"ensure_ascii\":true,\"id\":\"json2\",\"indent\":null,\"separators\":[\",\",\":\"],\"skipkeys\":false,\"sort_keys\":true,\"strict\":true}],\"order\":\"C\",\"shape\":[4],\"zarr_format\":2}", "z/0": "[\"a\",\"b\",\"c\",\"d\",\"|O\",[4]]", "z/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\"]}"}}
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved
52 changes: 52 additions & 0 deletions doc/user-guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -985,6 +985,57 @@ reads. Because this fall-back option is so much slower, xarray issues a
instead of falling back to try reading non-consolidated metadata.


.. _io.kerchunk:

Kerchunk
--------

`Kerchunk <https://fsspec.github.io/kerchunk/index.html>`_ is a Python library
that allows you to access chunked and compressed data formats (such as NetCDF3, NetCDF4, HDF5, GRIB2, TIFF & FITS),
many of which are primary data formats for many data archives, by viewing the
whole archive as an ephemeral `Zarr`_ dataset which allows for parallel, chunk-specific access.

Instead of creating a new copy of the dataset in the Zarr spec/format or
downloading the files locally, Kerchunk reads through the data archive and extracts the
byte range and compression information of each chunk and saves as a ``reference``.
These references are then saved as ``json`` files or ``parquet`` (more efficient)
for later use. You can view some of these stored in the **references**
dcherian marked this conversation as resolved.
Show resolved Hide resolved
directory `here <https://github.com/pydata/xarray-data>`_.


.. note::
These references follow this `specification <https://fsspec.github.io/kerchunk/spec.html>`_.
Packages like `kerchunk`_ and `virtualizarr <https://github.com/zarr-developers/VirtualiZarr>`_
help in creating and reading these references.

Reading these data archives becomes really easy with ``kerchunk`` in combination
with ``xarray``, especially when these archives are large in size. A single combined
reference can refer to thousands of the original data files present in these archives.
You can view the whole dataset with from this `combined reference` using the above packages.

The following example shows opening a combined references generated from a ``.hdf`` file.

.. ipython:: python

storage_options = {
"remote_protocol": "s3",
"skip_instance_cache": True,
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved
"target_protocol": "file",
}

ds = xr.open_dataset(
"./combined.json",
engine="kerchunk",
storage_options=storage_options,
)

ds

.. note::

You can refer this `project pythia kerchunk cookbook <https://projectpythia.org/kerchunk-cookbook/README.html>`_ for more information.
dcherian marked this conversation as resolved.
Show resolved Hide resolved


.. _io.iris:

Iris
Expand Down Expand Up @@ -1151,6 +1202,7 @@ The simplest way to serialize an xarray object is to use Python's built-in pickl
module:

.. ipython:: python
:okexcept:
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved

import pickle

Expand Down
Loading