-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VirtualDataset class instead of "virtual xr.Dataset" #171
Comments
@ghidalgo that is a totally fair complaint, that distinction between an air-quotes "virtual dataset" and a normal xarray dataset not being reflected in the top-level type is definitely confusing. Whilst the whole point of this package was to avoid having to write a new data structure with that re-implements class VirtualDataset(xr.Dataset):
def to_kerchunk(self):
...
def to_zarr(self):
... (Could you teach such a class that calling Subclassing is only kind of supported in xarray, but it could be, and this is a fairly simple application of it (see pydata/xarray#3980). The downsides of this idea are: |
Another way to close this would just add more explanation to the documentation, e.g. a section like: ### Virtual datasets are different to normal xarray Datasets!
A "virtual dataset" is any `xr.Dataset` which wraps one or more `ManifestArray` objects.
Although the top-level type is still xr.Dataset, they are intended only as an abstract representation of a set of files, not as something you can do analysis with.
They only support a very limited subset of normal xarray operations, particularly:
- `concat`
- `merge`
- `.rename`... |
FWIW here my intuition is that subclassing The reason we don't need to subclass in this case because we are not adding anything to the xarray data model, we're just using the same data model for a different purpose. |
I do like adding that wording to the documentation to make it very explicit that a virtual dataset is an |
This is a really interesting point and caused me some confusion as well. I am wondering if this is more reason to keep the "special" |
We're realistically not going to actually do |
Thanks @chuckwondo for pointing me back to this from #389. I was writing up this whole thing about how xarray has made progress on subclassing but then I looked at I am still grappling with this though because conceptually it feels to me like a "virtual Dataset" is a different thing from the "data Datasets" that you normally have in xarray. For instance writing a virtual Dataset has to do with writing manifests and opening a virtual Dataset has to do with reading or creating manifests. When you open a "virtual Dataset" after the manifest has been stored to disk the data itself isn't readable at all. Functions like Feel free to ignore this. I just needed to write it down. |
There's an Xarray issue I opened tracking what would be needed for subclassing. AFAIK no progress has been made since then, because nobody has stepped up to the plate (I really hoped the UXarray people would but alas).
It could be though. See Ayush's issue about loading data from a ManifestArray.
Yes - VirtualiZarr makes quite a big leap in how it sees what Xarray's purpose is. So far all the use cases were for wrapping in-memory (e.g. numpy, cupy, sparse) arrays or lazily-computed arrays (dask, cubed), which may or may not be distributed. But VirtualiZarr wraps a non-computable array. We are now using Xarray literally just as an organizing layer for managing the manipulation of abstract arrays. In our case these abstract arrays happen to map to the Zarr data model. The advantage of this wrapping approach is that you don't have to rewrite any named-dimension-handling logic, manipulation functions like concat/merge/combine, or data structures for storing multiple variables (i.e xr.Dataset). Kerchunk attempted to rewrite all of these, and the result was a one-off, highly unintuitive, and very buggy API. The disadvantage is that you create footguns that trigger whenever a user attempts to do something that can only be done using a computable array. We have some work to do here to better guard against these and raise clearer errors when they are triggered. |
What was unclear to me was that there are 2 kinds of
xr.Dataset
objects I encountered with VirtualiZarr: the one returned byvirtualizarr.open_virtual_dataset
and the one returned byxr.open_dataset(virtualizarr_produced_kerchunk_or_zarr_store)
.The first
xr.Dataset
is good for concating/merging and writing to storage so that it can be read by xarray later into a newxr.Dataset
which can be read. I know this is said on this page, but only in hindsight did I understand the implication:If I could wave a magic wand to make it better I would not want the return value of
open_virtual_dataset
to be anxr.Dataset
because that dataset doesn't behave like otherxr.Dataset
and instead return something like aVirtualiZarr.Dataset
which only has the functions that do work, but I understand thatxr.Dataset
is close enough and users (me) shouldn't expect it to load data.Originally posted by @ghidalgo3 in #114 (comment)
The text was updated successfully, but these errors were encountered: