-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inferring concatenation order from coordinate data values #18
Comments
My understanding (/ my recollection 😅 ) of the code in This would mean we would need to both read the kerchunk references and actually read data values from the file at dataset construction time. We would want to have a function flag to opt in to this behaviour. |
Okay I prematurely closed this, as it was not actually as simple as it seemed.
This is what I implemented in #52, but I didn't remember to check this part:
Turns out that doesn't work yet - at some point within In the meantime I should remove that section from the docs so as not to commit false advertising! EDIT: Removed in 2c5be3f |
In general I wonder if it would be useful to have some custom Xarray index here. Basically a simple subclass or a very lightweight wrapper around
Would such decoupling of the The tricky part is that we still need to refactor Xarray |
TWriting would speed things up reading from the consolidated files. Kerchunk calls this "inlining". For e.g. you wouldn't want read ~90k time steps from 90k files to construct a 90K long time coordinate |
Yeah that sounds like what we need! I basically need the
I think this is what I was hoping I could achieve with pydata/xarray#8872, but clearly I was missing some subtleties with
Naively, I don't really understand why this use case requires a custom Xarray index. I am still happy to use |
@dcherian if we have retained both the index and the |
IIUC yes. |
Xarray indexes have full control on the data wrapped by their corresponding indexed coordinates, via Currently Actually, it might be enough to have a custom index as a lightweight wrapper around any xarray index (PandasIndex or other) where |
Okay thanks. So implementing such an index would be essentially involve subclassing
This seems like something that might be useful in general, a sort of |
Alternatively, we could add a build option to |
A build option as in a new kwarg to
I don't think it is? Currently the only reason I want these indexes is so that |
Rather a new argument to
Yeah I don't know. Maybe to eventually allow users manually setting custom indexes and doing some domain-specific operations before "saving" the resulting virtual dataset? |
Both this and the subclassing-
Yeah quite possibly. One thing at a time though! I appreciate your input here. |
Yes sure! :-) |
@benbovy do we actually need a subclass / build option at all? Can't we just change the implementation of |
I guess in theory such memory optimization might make sense for other, in-memory duck arrays? In practice I don't know much. Exposing some control to advanced users for this behavior a-priori doesn't seem like a bad idea to me. |
I realized that In [1]: from virtualizarr import open_virtual_dataset
In [2]: import xarray as xr
In [3]: ds1 = open_virtual_dataset('air1.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'])
In [4]: ds2 = open_virtual_dataset('air2.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'])
In [5]: xr.combine_by_coords([ds2, ds1], coords='minimal', compat='override')
Out[5]:
<xarray.Dataset> Size: 8MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) float32 12kB 1.867e+06 1.867e+06 ... 1.885e+06 1.885e+06
Data variables:
air (time, lat, lon) int16 8MB ManifestArray<shape=(2920, 25, 53), d...
Attributes:
Conventions: COARDS
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948) the only issue with this result is that somehow the dtype of This approach doesn't solve the original issue, but it might also be fine in a lot of cases. |
* add warning when user passes indexes=None * expect warnings in tests * new test to show how #18 doesn't fully work yet * release notes * lint
In hindsight I think this was an unnecessary and overly-complicated requirement. It's often totally fine to just load dimension coordinates into memory and write them into the zarr (now icechunk) store. #357 changes the default to be to create indexes for any dimension coordinate which appears in |
* remove warning * fix test by loading dimension coordinates * fix other test by passing loadable_variables * move loadable variables docs section before combining * remove recommendation to not create indexes * de-emphasise avoiding creating indexes * document using xr.combine_by_coords * clarify todo * signpost segue * remove extra line * release notes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * correct some PR numbers in release notes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refer to #18 in release notes --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
If
xr.combine_by_coords
is used, then indexes must be created for dimension coordinates in order to use them to infer the correct order of concatenation. This requires loading of the data for that coordinate in order to create the index. We can do this, but I think we will then end up with a Dataset which contains some ManifestArrays but also some numpy arrays for the dimension coordinates. This isn't itself a problem - we have that information, we can load it from the files.The tricky part is that we presumably don't want to write those actual data values back out during the serialization step, we want to only write the chunk manifest for each dimension coordinate. How can we regain access to this information, or carry it around ready for serialization at the end?
The text was updated successfully, but these errors were encountered: