-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split kerchunk reader up #261
Conversation
Note to self: I should just use an ABC with an EDIT: done |
for more information, see https://pre-commit.ci
virtualizarr/readers/kerchunk.py
Outdated
from pathlib import Path | ||
from typing import Any, MutableMapping, Optional, cast | ||
from typing import Iterable, Mapping, Optional | ||
|
||
import ujson | ||
from xarray import Dataset | ||
from xarray.core.indexes import Index | ||
from xarray.core.variable import Variable | ||
|
||
from virtualizarr.backend import FileType, separate_coords | ||
from virtualizarr.manifests import ChunkManifest, ManifestArray | ||
from virtualizarr.readers.common import VirtualBackend | ||
from virtualizarr.translators.kerchunk import dataset_from_kerchunk_refs | ||
from virtualizarr.types.kerchunk import ( | ||
KerchunkArrRefs, | ||
KerchunkStoreRefs, | ||
) | ||
from virtualizarr.utils import _FsspecFSFromFilepath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice the kerchunk reader never imports kerchunk @mpiannucci
class VirtualBackend(ABC): | ||
@staticmethod | ||
def open_virtual_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that a @staticmethod
is the best way to do this but for now it's just an internal implementation detail. The important thing is that the open_virtual_dataset
function signature gets standardized by this approach.
@keewis you might have opinions on this PR - I'm trying to move towards a system of virtual "backend readers" like xarray has backends. My proposal (see #245) is to eventually have one actual xarray backend that calls virtualizarr, and virtualizarr calls one of a number of registered virtual backends depending on the filetype. |
Note: ensure warning is raised on |
I'm still somewhat split on the idea to create a However, I totally support the creation of a plugin system for For an actual code review I'd have to read the changes you're proposing here, which I will try do at some point during the weekend (but the PR is huge, so might take me some time). |
I agree actually. I'm still not sure that it's a good idea. Maybe
👍
No worries! I'm pretty sure this is fine, it can also always be refactored further. |
(the failing tests are caused by a bad merge, as the changes to |
Thanks for that heads up!
Actually the warning should be on |
# TODO add entrypoint to allow external libraries to add to this mapping | ||
VIRTUAL_BACKENDS = { | ||
"kerchunk": KerchunkVirtualBackend, | ||
"zarr_v3": ZarrV3VirtualBackend, | ||
"dmrpp": DMRPPVirtualBackend, | ||
# all the below call one of the kerchunk backends internally (https://fsspec.github.io/kerchunk/reference.html#file-format-backends) | ||
"netcdf3": NetCDF3VirtualBackend, | ||
"hdf5": HDF5VirtualBackend, | ||
"netcdf4": HDF5VirtualBackend, # note this is the same as for hdf5 | ||
"tiff": TIFFVirtualBackend, | ||
"fits": FITSVirtualBackend, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxrjones I think this is what you were suggesting.
I'm going to merge this now because I know @sharkinsspatial and @ayushnag would like to work off of it, but @keewis feel free to add any comments / thoughts here and I will address them in follow-up PRs. |
Instead of trying to treat all uses of kerchunk references as one "reader", this PR instead splits them to consider each separate filetype to be one "reader", even if many of those readers call kerchunk code.
This means every "reader" is now a separate definition of a function
open_virtual_dataset
, which the top-levelopen_virtual_dataset
picks between with one bigmatch case
statement. Generalizing thatmatch
to be pluggable by third-party libraries would close #245.This also makes the structure of our dependence on kerchunk much clearer - @mpiannucci it should now be a lot easier for you to try out the kerchunk-to-icechunk use case mentioned in #258 (comment).
dataset_from_kerchunk_refs
before returning a virtual dataset #257, addresses Make kerchunk dependency entirely optional #258 (comment), and is a big step towards Make readers pluggable via entrypoint system #245.Tests addeddocs/releases.rst
New functions/methods are listed inapi.rst
New functionality has documentationcc @norlandrhagen @sharkinsspatial @ghidalgo3