Problem using scan_grib with GEFS output #238

lucien-sim · 2022-10-19T17:13:58Z

Thank you for all the work you've done on Kerchunk! The prospect of accessing data spread between tons of tiny model output files in a cloud-performant manner is really exciting!

I'm currently running into some difficulties using Kerchunk to access GEFS output from AWS (https://registry.opendata.aws/noaa-gefs/). I'm following what I think is the "standard" pattern for accessing GRIB files (eg. https://nbviewer.org/gist/peterm790/92eb1df3d58ba41d3411f8a840be2452, #150) but am getting a confusing error related to coordinates when I try to open the files with xarray. I've gathered that Kerchunk doesn't always play well with GEFS/GFS grib files and am happy to experiment with cogrib, but thought I'd put post the issue in case there's an easy fix.

Here's my code:

import xarray as xr
import ujson
from kerchunk.grib2 import scan_grib
from kerchunk.combine import MultiZarrToZarr
import fsspec
import dask

# Set details for GEFS files, variables. 
# For simplicity, only want first 6 hours for a single variable and ensemble
# member (control) in a single run. 
run_date = '20221011'
cycle = '00'
vbl_collection = 'pgrb2ap5'
type_of_level = 'heightAboveGround'
level = 2
grib_name = 't2m'
fhr_range = (0, 6)
json_dir = '/Users/lucien/Documents/projects/kerchunk'

s3_so = {
    'anon': True, 
    'skip_instance_cache': True
}

afilter = {
    'typeOfLevel': type_of_level, 
    'level': level, 
    'cfVarName': grib_name
}

s3_fs = fsspec.filesystem('s3', **s3_so)
local_fs = fsspec.filesystem('file', auto_mkdir=True)

# Get list of remote files. 
files = [
    f"s3://{f}"
    for f in s3_fs.glob(f"s3://noaa-gefs-pds/gefs.{run_date}/{cycle}/atmos/{vbl_collection}/*")
    if f[-4:] != '.idx'
    and '/gec00.' in f
    and fhr_range[0] <= int(f[-3:]) <= fhr_range[1]
]

print(files)

def make_json_name(file_url: str, message_number: int) -> str: 
    grib_name = file_url.split('/')[-1]
    return f"{json_dir}/{grib_name}_message{message_number}.json"


def gen_json(file_url: str) -> None:
    out = scan_grib(file_url, storage_options=s3_so, filter=afilter) 
    for i, message in enumerate(out): 
        out_file_name = make_json_name(file_url, i)
        with local_fs.open(out_file_name, 'w') as f: 
            f.write(ujson.dumps(message))

# Create fresh local directory for JSON output. 
local_fs.rm(json_dir, recursive=True)
local_fs.mkdir(json_dir)

# Write JSONs in parallel
with dask.distributed.LocalCluster(n_workers=3, processes=True) as cluster, \
        dask.distributed.Client(cluster) as client: 
    print('JSON generation running on dask server:', client.dashboard_link)
    jobs = [
        dask.delayed(gen_json)(f)
        for f in files[:2]  # Testing with first two files for now. 
    ]
    _ = dask.compute(*jobs, retries=10)

messages = local_fs.ls(json_dir)
m_list = list()
json_text = list()
for m in messages: 
    with local_fs.open(m, 'r') as f: 
        txt = ujson.load(f)
        json_text.append(txt)
        m_list.append(
            fsspec.get_mapper(
                'reference://', 
                fo=txt, 
                remote_protocol='s3', 
                remote_options={'anon': True}
            )
        )

ds = xr.open_mfdataset(
    m_list, 
    engine='zarr', 
    combine='nested', 
    concat_dim='valid_time', 
    coords='minimal', 
    data_vars='minimal', 
    compat='override'
)

And here's the error I'm getting. It's being raised on the final xr.open_mfdataset line.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb Cell 2 in <cell line: 88>()
     [78](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=77)         json_text.append(txt)
     [79](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=78)         m_list.append(
     [80](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=79)             fsspec.get_mapper(
     [81](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=80)                 'reference://', 
   (...)
     [85](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=84)             )
     [86](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=85)         )
---> [88](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=87) ds = xr.open_mfdataset(
     [89](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=88)     m_list, 
     [90](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=89)     engine='zarr', 
     [91](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=90)     combine='nested', 
     [92](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=91)     concat_dim='valid_time', 
     [93](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=92)     coords='minimal', 
     [94](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=93)     data_vars='minimal', 
     [95](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=94)     compat='override'
     [96](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=95) )

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/api.py:996, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
    993     open_ = open_dataset
    994     getattr_ = getattr
--> 996 datasets = [open_(p, **open_kwargs) for p in paths]
    997 closers = [getattr_(ds, "_close") for ds in datasets]
    998 if preprocess is not None:

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/api.py:996, in <listcomp>(.0)
    993     open_ = open_dataset
    994     getattr_ = getattr
--> 996 datasets = [open_(p, **open_kwargs) for p in paths]
    997 closers = [getattr_(ds, "_close") for ds in datasets]
    998 if preprocess is not None:

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/api.py:539, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs)
    527 decoders = _resolve_decoders_kwargs(
    528     decode_cf,
    529     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    535     decode_coords=decode_coords,
    536 )
    538 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 539 backend_ds = backend.open_dataset(
    540     filename_or_obj,
    541     drop_variables=drop_variables,
    542     **decoders,
    543     **kwargs,
    544 )
    545 ds = _dataset_from_backend_dataset(
    546     backend_ds,
    547     filename_or_obj,
   (...)
    555     **kwargs,
    556 )
    557 return ds

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/zarr.py:854, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
    852 store_entrypoint = StoreBackendEntrypoint()
    853 with close_on_error(store):
--> 854     ds = store_entrypoint.open_dataset(
    855         store,
    856         mask_and_scale=mask_and_scale,
    857         decode_times=decode_times,
    858         concat_characters=concat_characters,
    859         decode_coords=decode_coords,
    860         drop_variables=drop_variables,
    861         use_cftime=use_cftime,
    862         decode_timedelta=decode_timedelta,
    863     )
    864 return ds

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/store.py:26, in StoreBackendEntrypoint.open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     14 def open_dataset(
     15     self,
     16     store,
   (...)
     24     decode_timedelta=None,
     25 ):
---> 26     vars, attrs = store.load()
     27     encoding = store.get_encoding()
     29     vars, attrs, coord_names = conventions.decode_cf_variables(
     30         vars,
     31         attrs,
   (...)
     38         decode_timedelta=decode_timedelta,
     39     )

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/common.py:128, in AbstractDataStore.load(self)
    106 def load(self):
    107     """
    108     This loads the variables and attributes simultaneously.
    109     A centralized loading function makes it easier to create
   (...)
    125     are requested, so care should be taken to make sure its fast.
    126     """
...
    617         f"number of data dimensions, ndim={self.ndim}"
    618     )
    619 return dims

ValueError: dimensions ('latitude',) must have the same length as the number of data dimensions, ndim=2

I'm currently using fsspec 2022.8.2, kerchunk 0.0.8, s3fs 2022.8.2, cfgrib 0.9.10.2, and xarray 2022.10.0. I think these are all the latest or fairly recent versions.

And here's the content of one of the json message files that's being created:

{
    "version": 1,
    "refs": {
        ".zgroup": "{\n    \"zarr_format\": 2\n}",
        ".zattrs": "{\n    \"centre\": \"kwbc\",\n    \"centreDescription\": \"US National Weather Service - NCEP\",\n    \"edition\": 2,\n    \"subCentre\": 2\n}",
        "2t\/.zarray": "{\n    \"chunks\": [\n        361,\n        720\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<f8\",\n    \"fill_value\": 3.4028234663852886e+38,\n    \"filters\": [\n        {\n            \"dtype\": \"float64\",\n            \"id\": \"grib\",\n            \"var\": \"2t\"\n        }\n    ],\n    \"order\": \"C\",\n    \"shape\": [\n        361,\n        720\n    ],\n    \"zarr_format\": 2\n}",
        "2t\/0.0": [
            "{{u}}",
            11856882,
            242000
        ],
        "2t\/.zattrs": "{\n    \"NV\": 0,\n    \"_ARRAY_DIMENSIONS\": [\n        \"longitude\",\n        \"latitude\"\n    ],\n    \"cfName\": \"air_temperature\",\n    \"cfVarName\": \"t2m\",\n    \"dataDate\": 20221011,\n    \"dataTime\": 0,\n    \"dataType\": \"cf\",\n    \"endStep\": 0,\n    \"gridDefinitionDescription\": \"Latitude\/longitude. Also called equidistant cylindrical, or Plate Carree\",\n    \"gridType\": \"regular_ll\",\n    \"missingValue\": 3.4028234663852886e+38,\n    \"name\": \"2 metre temperature\",\n    \"numberOfPoints\": 259920,\n    \"paramId\": 167,\n    \"shortName\": \"2t\",\n    \"stepType\": \"instant\",\n    \"stepUnits\": 1,\n    \"totalNumber\": 30,\n    \"typeOfLevel\": \"heightAboveGround\",\n    \"units\": \"K\"\n}",
        "heightAboveGround\/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<i8\",\n    \"fill_value\": null,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
        "heightAboveGround\/0": "\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
        "heightAboveGround\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"heightAboveGround\"\n    ],\n    \"long_name\": \"height above the surface\",\n    \"positive\": \"up\",\n    \"standard_name\": \"height\",\n    \"units\": \"m\"\n}",
        "latitude\/.zarray": "{\n    \"chunks\": [\n        361,\n        720\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<f8\",\n    \"fill_value\": null,\n    \"filters\": [\n        {\n            \"dtype\": \"float64\",\n            \"id\": \"grib\",\n            \"var\": \"latitude\"\n        }\n    ],\n    \"order\": \"C\",\n    \"shape\": [\n        361,\n        720\n    ],\n    \"zarr_format\": 2\n}",
        "latitude\/0.0": [
            "{{u}}",
            11856882,
            242000
        ],
        "latitude\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"latitude\"\n    ],\n    \"long_name\": \"latitude\",\n    \"standard_name\": \"latitude\",\n    \"units\": \"degrees_north\"\n}",
        "longitude\/.zarray": "{\n    \"chunks\": [\n        361,\n        720\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<f8\",\n    \"fill_value\": null,\n    \"filters\": [\n        {\n            \"dtype\": \"float64\",\n            \"id\": \"grib\",\n            \"var\": \"longitude\"\n        }\n    ],\n    \"order\": \"C\",\n    \"shape\": [\n        361,\n        720\n    ],\n    \"zarr_format\": 2\n}",
        "longitude\/0.0": [
            "{{u}}",
            11856882,
            242000
        ],
        "longitude\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"longitude\"\n    ],\n    \"long_name\": \"longitude\",\n    \"standard_name\": \"longitude\",\n    \"units\": \"degrees_east\"\n}",
        "number\/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<i8\",\n    \"fill_value\": null,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
        "number\/0": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
        "number\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"number\"\n    ],\n    \"long_name\": \"ensemble member numerical id\",\n    \"standard_name\": \"realization\",\n    \"units\": \"1\"\n}",
        "step\/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<f8\",\n    \"fill_value\": null,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
        "step\/0": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
        "step\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"step\"\n    ],\n    \"long_name\": \"time since forecast_reference_time\",\n    \"standard_name\": \"forecast_period\",\n    \"units\": \"hours\"\n}",
        "time\/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<i8\",\n    \"fill_value\": null,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
        "time\/0": "base64:ALJEYwAAAAA=",
        "time\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"time\"\n    ],\n    \"calendar\": \"proleptic_gregorian\",\n    \"long_name\": \"initial time of forecast\",\n    \"standard_name\": \"forecast_reference_time\",\n    \"units\": \"seconds since 1970-01-01T00:00:00\"\n}",
        "valid_time\/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<i8\",\n    \"fill_value\": null,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
        "valid_time\/0": "base64:ALJEYwAAAAA=",
        "valid_time\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"valid_time\"\n    ],\n    \"calendar\": \"proleptic_gregorian\",\n    \"long_name\": \"time\",\n    \"standard_name\": \"time\",\n    \"units\": \"seconds since 1970-01-01T00:00:00\"\n}"
    },
    "templates": {
        "u": "s3:\/\/noaa-gefs-pds\/gefs.20221011\/00\/atmos\/pgrb2ap5\/gec00.t00z.pgrb2a.0p50.f000"
    }
}

I'm having some difficulty interpreting the error. The underlying data files clearly have two dimensions (latitude, longitude), which I have confirmed by opening the grib files directly. xarray is picking up on this (it says ndim=2) but for some reason only one of the dimensions is being passed into something that needs both? I don't have much experience interpreting the .json file but I'm not seeing any obvious problems. I'm also unaware of any args/kwargs to scan_grib or open_mfdataset that would address the underlying issue.

I also tried to open the files via MultiZarrToZarr and xr.open_dataset as in the HRRR tutorial but got the same error.

Is this a problem that can be fixed easily? Or is there some kind of underlying incompatibility with GEFS grib files at the moment? And if the current version of scan_grib won't work, do you think going the cogrib route has a chance of solving the problem?

Thanks in advance for your help!

Lucien

The text was updated successfully, but these errors were encountered:

martindurant · 2022-10-20T13:57:56Z

It seems that cfgrib and eccodes treat 1D lat/lon arrays differently... Please try the linked PR.

lucien-sim · 2022-10-20T20:40:46Z

That worked! The data loads now. Thank you!

martindurant mentioned this issue Oct 20, 2022

Fix for lat/lon 1D array in grib2 #239

Merged

martindurant closed this as completed in #239 Oct 24, 2022

mpiannucci mentioned this issue Oct 28, 2022

Latitude and Longitude are not loaded in for GFS Grib2 datasets #248

Closed

martindurant mentioned this issue Oct 28, 2022

Fix gridding for regular_ll grib2 grid types #249

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem using scan_grib with GEFS output #238

Problem using scan_grib with GEFS output #238

lucien-sim commented Oct 19, 2022

martindurant commented Oct 20, 2022

lucien-sim commented Oct 20, 2022

Problem using scan_grib with GEFS output #238

Problem using scan_grib with GEFS output #238

Comments

lucien-sim commented Oct 19, 2022

martindurant commented Oct 20, 2022

lucien-sim commented Oct 20, 2022