-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concatenating datasets with staggered grids #362
Comments
When I actually tried this on my toy example locally I got an error: import ujson
from pathlib import Path
from kerchunk.combine import MultiZarrToZarr
# save example datasets as netCDF
ds1.to_netcdf('./staggered_kerchunk_test/0.nc')
ds2.to_netcdf('./staggered_kerchunk_test/1.nc')
fs2 = fsspec.filesystem('') # local file system to save final jsons to
def gen_json(in_filepath: str):
in_filename = Path(in_filepath).stem
h5chunks = SingleHdf5ToZarr(in_filepath, inline_threshold=300)
out_filepath = f'staggered_kerchunk_test/{in_filename}.json'
with fs2.open(out_filepath, 'wb') as f:
f.write(ujson.dumps(h5chunks.translate()).encode())
for file in ['staggered_kerchunk_test/0.nc', 'staggered_kerchunk_test/1.nc']:
gen_json(file)
# concatenate only variable 'a' along dimension 'x_center'
mzz = MultiZarrToZarr(
['staggered_kerchunk_test/0.json', 'staggered_kerchunk_test/1.json'],
concat_dims=['x_center'],
preprocess=kerchunk.combine.drop('b'),
)
d = mzz.translate('staggered_kerchunk_test/combined.json') --------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[99], line 1
----> 1 d = mzz.translate('staggered_kerchunk_test/combined.json')
File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/kerchunk/combine.py:496, in MultiZarrToZarr.translate(self, filename, storage_options)
490 """Perform all stages and return the resultant references dict
491
492 If filename and storage options are given, the output is written to this
493 file using ujson and fsspec instead of being returned.
494 """
495 if 1 not in self.done:
--> 496 self.first_pass()
497 if 2 not in self.done:
498 self.store_coords()
File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/kerchunk/combine.py:258, in MultiZarrToZarr.first_pass(self)
256 z = zarr.open_group(fs.get_mapper(""))
257 for var in self.concat_dims:
--> 258 value = self._get_value(i, z, var, fn=self._paths[i])
259 if isinstance(value, np.ndarray):
260 value = value.ravel()
File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/kerchunk/combine.py:226, in MultiZarrToZarr._get_value(self, index, z, var, fn)
224 o = z[var].attrs[item]
225 elif selector.startswith("data:"):
--> 226 o = z[selector.split(":", 1)[1]][...]
227 elif selector.startswith("cf:"):
228 import cftime
File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/zarr/hierarchy.py:500, in Group.__getitem__(self, item)
498 raise KeyError(item)
499 else:
--> 500 raise KeyError(item)
KeyError: 'x_center' This doesn't make sense to me - if I only drop variable 'b', then only 'a' should be left, so why can't it find 'x_center'`? |
MultiZarrToZarr is very much oriented towards a netCDF-style merger of sub-datasets, wherein the variables share coordinates with one-another, as opposed to the linear concat going on here. Other styles of merging are completely reasonable and we should develop ways to run them and to help users to choose between them. Before going further, I believe the exception you are seeing is because x_center is has no defined values (they become [0, 1, 2] in each input and aren't represented in the zarr version of each input at all), so kerchunk has no way to know where each of the constituent datasets are supposed to go. If you define them as This particular dataset has variable |
Is there some other tool that I should be using instead?
Use
Why is
In this case presumably it should still work because my chunks are of the form {X, X, X, ... Y}, where Y<=X. I understand this general restriction though. |
Simply that we haven't a made a case for it. We have either take the values as they appear ( [0, 1, 2] for both sets) or take the index of the filename (would be [0, 0, 0], [1, 1, 1]!) but not a combination. I agree that we should be able to spot the case that the coord is missing from the input completely and cope with that - but this can only work when concating on exactly one axis.
Unfortunately, no. Zarr does allow the last chunk to be partial, but is still saves a buffer equivalent to a whole chunk and then slices off the needed parts. |
So if I promoted my dimensions to be coordinate variables then I could do a multi-dimensional concatenation like I suggested above?
That would be nice. It is what
The way we dealt with that ambiguity in xarray was to allow
Sorry, you're saying that whilst Zarr can represent arrays where the final chunk is partial, you currently cannot build a kerchunk reference from netCDF files where the final chunk is smaller, because of the way Zarr actually handles that partial chunk? |
An example: note how keys "0" and "1" have the same number of bytes. They are expected to be the size of the full chunk on read.
|
Here are two toy datasets designed to represent sections of a dataset that has variables living on a staggered grid. This type of dataset is common in fluid modelling (handling staggered grids is why xGCM exists).
I have netcdf output files from an ocean model (UCLA-ROMS) that have this basic structure.
Combining these types of datasets seems like a bit of a pain to do with kerchunk at the moment.
To concatenate along the x direction, I actually need to concatenate
a
alongx_center
, andb
alongx_outer
. So presumably I have to callMultiZarrToZarr
once for each variable (or group of variables) that needs to be concatenated along a common dimension. My real dataset is split along multiple dimensions and has multiple staggered grid locations for each dimension, meaning I have to callMultiZarrToZarr
something like 6 times.This problem is analogous to what happens inside
xarray.combine_by_coords
, which automaticallygroups variables into sets with common dimensionssplits datasets up into sets consisting of the same variable from each dataset, concatenates each set separately (along multiple dimensions in general), then merges the results.Is that approach (call
MultiZarrToZarr
multiple times then callkerchunk.combine.merge_vars
) the recommended way to handle this case currently? Could we imagine some improvement to thekerchunk.combine
API that might make this easier?The text was updated successfully, but these errors were encountered: