You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for all the work you've done on Kerchunk! The prospect of accessing data spread between tons of tiny model output files in a cloud-performant manner is really exciting!
I'm currently running into some difficulties using Kerchunk to access GEFS output from AWS (https://registry.opendata.aws/noaa-gefs/). I'm following what I think is the "standard" pattern for accessing GRIB files (eg. https://nbviewer.org/gist/peterm790/92eb1df3d58ba41d3411f8a840be2452, #150) but am getting a confusing error related to coordinates when I try to open the files with xarray. I've gathered that Kerchunk doesn't always play well with GEFS/GFS grib files and am happy to experiment with cogrib, but thought I'd put post the issue in case there's an easy fix.
Here's my code:
importxarrayasxrimportujsonfromkerchunk.grib2importscan_gribfromkerchunk.combineimportMultiZarrToZarrimportfsspecimportdask# Set details for GEFS files, variables. # For simplicity, only want first 6 hours for a single variable and ensemble# member (control) in a single run. run_date='20221011'cycle='00'vbl_collection='pgrb2ap5'type_of_level='heightAboveGround'level=2grib_name='t2m'fhr_range= (0, 6)
json_dir='/Users/lucien/Documents/projects/kerchunk's3_so= {
'anon': True,
'skip_instance_cache': True
}
afilter= {
'typeOfLevel': type_of_level,
'level': level,
'cfVarName': grib_name
}
s3_fs=fsspec.filesystem('s3', **s3_so)
local_fs=fsspec.filesystem('file', auto_mkdir=True)
# Get list of remote files. files= [
f"s3://{f}"forfins3_fs.glob(f"s3://noaa-gefs-pds/gefs.{run_date}/{cycle}/atmos/{vbl_collection}/*")
iff[-4:] !='.idx'and'/gec00.'infandfhr_range[0] <=int(f[-3:]) <=fhr_range[1]
]
print(files)
defmake_json_name(file_url: str, message_number: int) ->str:
grib_name=file_url.split('/')[-1]
returnf"{json_dir}/{grib_name}_message{message_number}.json"defgen_json(file_url: str) ->None:
out=scan_grib(file_url, storage_options=s3_so, filter=afilter)
fori, messageinenumerate(out):
out_file_name=make_json_name(file_url, i)
withlocal_fs.open(out_file_name, 'w') asf:
f.write(ujson.dumps(message))
# Create fresh local directory for JSON output. local_fs.rm(json_dir, recursive=True)
local_fs.mkdir(json_dir)
# Write JSONs in parallelwithdask.distributed.LocalCluster(n_workers=3, processes=True) ascluster, \
dask.distributed.Client(cluster) asclient:
print('JSON generation running on dask server:', client.dashboard_link)
jobs= [
dask.delayed(gen_json)(f)
forfinfiles[:2] # Testing with first two files for now.
]
_=dask.compute(*jobs, retries=10)
messages=local_fs.ls(json_dir)
m_list=list()
json_text=list()
forminmessages:
withlocal_fs.open(m, 'r') asf:
txt=ujson.load(f)
json_text.append(txt)
m_list.append(
fsspec.get_mapper(
'reference://',
fo=txt,
remote_protocol='s3',
remote_options={'anon': True}
)
)
ds=xr.open_mfdataset(
m_list,
engine='zarr',
combine='nested',
concat_dim='valid_time',
coords='minimal',
data_vars='minimal',
compat='override'
)
And here's the error I'm getting. It's being raised on the final xr.open_mfdataset line.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb Cell 2 in <cell line: 88>()
[78](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=77) json_text.append(txt)
[79](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=78) m_list.append(
[80](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=79) fsspec.get_mapper(
[81](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=80) 'reference://',
(...)
[85](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=84) )
[86](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=85) )
---> [88](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=87) ds = xr.open_mfdataset(
[89](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=88) m_list,
[90](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=89) engine='zarr',
[91](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=90) combine='nested',
[92](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=91) concat_dim='valid_time',
[93](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=92) coords='minimal',
[94](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=93) data_vars='minimal',
[95](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=94) compat='override'
[96](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=95) )
File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/api.py:996, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
993 open_ = open_dataset
994 getattr_ = getattr
--> 996 datasets = [open_(p, **open_kwargs) for p in paths]
997 closers = [getattr_(ds, "_close") for ds in datasets]
998 if preprocess is not None:
File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/api.py:996, in <listcomp>(.0)
993 open_ = open_dataset
994 getattr_ = getattr
--> 996 datasets = [open_(p, **open_kwargs) for p in paths]
997 closers = [getattr_(ds, "_close") for ds in datasets]
998 if preprocess is not None:
File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/api.py:539, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs)
527 decoders = _resolve_decoders_kwargs(
528 decode_cf,
529 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...)
535 decode_coords=decode_coords,
536 )
538 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 539 backend_ds = backend.open_dataset(
540 filename_or_obj,
541 drop_variables=drop_variables,
542 **decoders,
543 **kwargs,
544 )
545 ds = _dataset_from_backend_dataset(
546 backend_ds,
547 filename_or_obj,
(...)
555 **kwargs,
556 )
557 return ds
File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/zarr.py:854, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
852 store_entrypoint = StoreBackendEntrypoint()
853 with close_on_error(store):
--> 854 ds = store_entrypoint.open_dataset(
855 store,
856 mask_and_scale=mask_and_scale,
857 decode_times=decode_times,
858 concat_characters=concat_characters,
859 decode_coords=decode_coords,
860 drop_variables=drop_variables,
861 use_cftime=use_cftime,
862 decode_timedelta=decode_timedelta,
863 )
864 return ds
File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/store.py:26, in StoreBackendEntrypoint.open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
14 def open_dataset(
15 self,
16 store,
(...)
24 decode_timedelta=None,
25 ):
---> 26 vars, attrs = store.load()
27 encoding = store.get_encoding()
29 vars, attrs, coord_names = conventions.decode_cf_variables(
30 vars,
31 attrs,
(...)
38 decode_timedelta=decode_timedelta,
39 )
File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/common.py:128, in AbstractDataStore.load(self)
106 def load(self):
107 """
108 This loads the variables and attributes simultaneously.
109 A centralized loading function makes it easier to create
(...)
125 are requested, so care should be taken to make sure its fast.
126 """
...
617 f"number of data dimensions, ndim={self.ndim}"
618 )
619 return dims
ValueError: dimensions ('latitude',) must have the same length as the number of data dimensions, ndim=2
I'm currently using fsspec 2022.8.2, kerchunk 0.0.8, s3fs 2022.8.2, cfgrib 0.9.10.2, and xarray 2022.10.0. I think these are all the latest or fairly recent versions.
And here's the content of one of the json message files that's being created:
I'm having some difficulty interpreting the error. The underlying data files clearly have two dimensions (latitude, longitude), which I have confirmed by opening the grib files directly. xarray is picking up on this (it says ndim=2) but for some reason only one of the dimensions is being passed into something that needs both? I don't have much experience interpreting the .json file but I'm not seeing any obvious problems. I'm also unaware of any args/kwargs to scan_grib or open_mfdataset that would address the underlying issue.
I also tried to open the files via MultiZarrToZarr and xr.open_dataset as in the HRRR tutorial but got the same error.
Is this a problem that can be fixed easily? Or is there some kind of underlying incompatibility with GEFS grib files at the moment? And if the current version of scan_grib won't work, do you think going the cogrib route has a chance of solving the problem?
Thanks in advance for your help!
Lucien
The text was updated successfully, but these errors were encountered:
Thank you for all the work you've done on Kerchunk! The prospect of accessing data spread between tons of tiny model output files in a cloud-performant manner is really exciting!
I'm currently running into some difficulties using Kerchunk to access GEFS output from AWS (https://registry.opendata.aws/noaa-gefs/). I'm following what I think is the "standard" pattern for accessing GRIB files (eg. https://nbviewer.org/gist/peterm790/92eb1df3d58ba41d3411f8a840be2452, #150) but am getting a confusing error related to coordinates when I try to open the files with xarray. I've gathered that Kerchunk doesn't always play well with GEFS/GFS grib files and am happy to experiment with cogrib, but thought I'd put post the issue in case there's an easy fix.
Here's my code:
And here's the error I'm getting. It's being raised on the final
xr.open_mfdataset
line.I'm currently using fsspec 2022.8.2, kerchunk 0.0.8, s3fs 2022.8.2, cfgrib 0.9.10.2, and xarray 2022.10.0. I think these are all the latest or fairly recent versions.
And here's the content of one of the json message files that's being created:
I'm having some difficulty interpreting the error. The underlying data files clearly have two dimensions (latitude, longitude), which I have confirmed by opening the grib files directly. xarray is picking up on this (it says ndim=2) but for some reason only one of the dimensions is being passed into something that needs both? I don't have much experience interpreting the .json file but I'm not seeing any obvious problems. I'm also unaware of any args/kwargs to
scan_grib
oropen_mfdataset
that would address the underlying issue.I also tried to open the files via
MultiZarrToZarr
andxr.open_dataset
as in the HRRR tutorial but got the same error.Is this a problem that can be fixed easily? Or is there some kind of underlying incompatibility with GEFS grib files at the moment? And if the current version of
scan_grib
won't work, do you think going the cogrib route has a chance of solving the problem?Thanks in advance for your help!
Lucien
The text was updated successfully, but these errors were encountered: