Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem using scan_grib with GEFS output #238

Closed
lucien-sim opened this issue Oct 19, 2022 · 2 comments · Fixed by #239
Closed

Problem using scan_grib with GEFS output #238

lucien-sim opened this issue Oct 19, 2022 · 2 comments · Fixed by #239

Comments

@lucien-sim
Copy link

Thank you for all the work you've done on Kerchunk! The prospect of accessing data spread between tons of tiny model output files in a cloud-performant manner is really exciting!

I'm currently running into some difficulties using Kerchunk to access GEFS output from AWS (https://registry.opendata.aws/noaa-gefs/). I'm following what I think is the "standard" pattern for accessing GRIB files (eg. https://nbviewer.org/gist/peterm790/92eb1df3d58ba41d3411f8a840be2452, #150) but am getting a confusing error related to coordinates when I try to open the files with xarray. I've gathered that Kerchunk doesn't always play well with GEFS/GFS grib files and am happy to experiment with cogrib, but thought I'd put post the issue in case there's an easy fix.

Here's my code:

import xarray as xr
import ujson
from kerchunk.grib2 import scan_grib
from kerchunk.combine import MultiZarrToZarr
import fsspec
import dask

# Set details for GEFS files, variables. 
# For simplicity, only want first 6 hours for a single variable and ensemble
# member (control) in a single run. 
run_date = '20221011'
cycle = '00'
vbl_collection = 'pgrb2ap5'
type_of_level = 'heightAboveGround'
level = 2
grib_name = 't2m'
fhr_range = (0, 6)
json_dir = '/Users/lucien/Documents/projects/kerchunk'

s3_so = {
    'anon': True, 
    'skip_instance_cache': True
}

afilter = {
    'typeOfLevel': type_of_level, 
    'level': level, 
    'cfVarName': grib_name
}

s3_fs = fsspec.filesystem('s3', **s3_so)
local_fs = fsspec.filesystem('file', auto_mkdir=True)

# Get list of remote files. 
files = [
    f"s3://{f}"
    for f in s3_fs.glob(f"s3://noaa-gefs-pds/gefs.{run_date}/{cycle}/atmos/{vbl_collection}/*")
    if f[-4:] != '.idx'
    and '/gec00.' in f
    and fhr_range[0] <= int(f[-3:]) <= fhr_range[1]
]

print(files)

def make_json_name(file_url: str, message_number: int) -> str: 
    grib_name = file_url.split('/')[-1]
    return f"{json_dir}/{grib_name}_message{message_number}.json"


def gen_json(file_url: str) -> None:
    out = scan_grib(file_url, storage_options=s3_so, filter=afilter) 
    for i, message in enumerate(out): 
        out_file_name = make_json_name(file_url, i)
        with local_fs.open(out_file_name, 'w') as f: 
            f.write(ujson.dumps(message))

# Create fresh local directory for JSON output. 
local_fs.rm(json_dir, recursive=True)
local_fs.mkdir(json_dir)

# Write JSONs in parallel
with dask.distributed.LocalCluster(n_workers=3, processes=True) as cluster, \
        dask.distributed.Client(cluster) as client: 
    print('JSON generation running on dask server:', client.dashboard_link)
    jobs = [
        dask.delayed(gen_json)(f)
        for f in files[:2]  # Testing with first two files for now. 
    ]
    _ = dask.compute(*jobs, retries=10)

messages = local_fs.ls(json_dir)
m_list = list()
json_text = list()
for m in messages: 
    with local_fs.open(m, 'r') as f: 
        txt = ujson.load(f)
        json_text.append(txt)
        m_list.append(
            fsspec.get_mapper(
                'reference://', 
                fo=txt, 
                remote_protocol='s3', 
                remote_options={'anon': True}
            )
        )

ds = xr.open_mfdataset(
    m_list, 
    engine='zarr', 
    combine='nested', 
    concat_dim='valid_time', 
    coords='minimal', 
    data_vars='minimal', 
    compat='override'
)

And here's the error I'm getting. It's being raised on the final xr.open_mfdataset line.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb Cell 2 in <cell line: 88>()
     [78](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=77)         json_text.append(txt)
     [79](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=78)         m_list.append(
     [80](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=79)             fsspec.get_mapper(
     [81](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=80)                 'reference://', 
   (...)
     [85](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=84)             )
     [86](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=85)         )
---> [88](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=87) ds = xr.open_mfdataset(
     [89](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=88)     m_list, 
     [90](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=89)     engine='zarr', 
     [91](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=90)     combine='nested', 
     [92](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=91)     concat_dim='valid_time', 
     [93](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=92)     coords='minimal', 
     [94](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=93)     data_vars='minimal', 
     [95](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=94)     compat='override'
     [96](vscode-notebook-cell:/Users/lucien/Documents/projects/kerchunk_test/gefs_test.ipynb#W1sZmlsZQ%3D%3D?line=95) )

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/api.py:996, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
    993     open_ = open_dataset
    994     getattr_ = getattr
--> 996 datasets = [open_(p, **open_kwargs) for p in paths]
    997 closers = [getattr_(ds, "_close") for ds in datasets]
    998 if preprocess is not None:

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/api.py:996, in <listcomp>(.0)
    993     open_ = open_dataset
    994     getattr_ = getattr
--> 996 datasets = [open_(p, **open_kwargs) for p in paths]
    997 closers = [getattr_(ds, "_close") for ds in datasets]
    998 if preprocess is not None:

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/api.py:539, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs)
    527 decoders = _resolve_decoders_kwargs(
    528     decode_cf,
    529     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    535     decode_coords=decode_coords,
    536 )
    538 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 539 backend_ds = backend.open_dataset(
    540     filename_or_obj,
    541     drop_variables=drop_variables,
    542     **decoders,
    543     **kwargs,
    544 )
    545 ds = _dataset_from_backend_dataset(
    546     backend_ds,
    547     filename_or_obj,
   (...)
    555     **kwargs,
    556 )
    557 return ds

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/zarr.py:854, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
    852 store_entrypoint = StoreBackendEntrypoint()
    853 with close_on_error(store):
--> 854     ds = store_entrypoint.open_dataset(
    855         store,
    856         mask_and_scale=mask_and_scale,
    857         decode_times=decode_times,
    858         concat_characters=concat_characters,
    859         decode_coords=decode_coords,
    860         drop_variables=drop_variables,
    861         use_cftime=use_cftime,
    862         decode_timedelta=decode_timedelta,
    863     )
    864 return ds

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/store.py:26, in StoreBackendEntrypoint.open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     14 def open_dataset(
     15     self,
     16     store,
   (...)
     24     decode_timedelta=None,
     25 ):
---> 26     vars, attrs = store.load()
     27     encoding = store.get_encoding()
     29     vars, attrs, coord_names = conventions.decode_cf_variables(
     30         vars,
     31         attrs,
   (...)
     38         decode_timedelta=decode_timedelta,
     39     )

File /opt/anaconda3/envs/kerchunk_test/lib/python3.9/site-packages/xarray/backends/common.py:128, in AbstractDataStore.load(self)
    106 def load(self):
    107     """
    108     This loads the variables and attributes simultaneously.
    109     A centralized loading function makes it easier to create
   (...)
    125     are requested, so care should be taken to make sure its fast.
    126     """
...
    617         f"number of data dimensions, ndim={self.ndim}"
    618     )
    619 return dims

ValueError: dimensions ('latitude',) must have the same length as the number of data dimensions, ndim=2

I'm currently using fsspec 2022.8.2, kerchunk 0.0.8, s3fs 2022.8.2, cfgrib 0.9.10.2, and xarray 2022.10.0. I think these are all the latest or fairly recent versions.

And here's the content of one of the json message files that's being created:

{
    "version": 1,
    "refs": {
        ".zgroup": "{\n    \"zarr_format\": 2\n}",
        ".zattrs": "{\n    \"centre\": \"kwbc\",\n    \"centreDescription\": \"US National Weather Service - NCEP\",\n    \"edition\": 2,\n    \"subCentre\": 2\n}",
        "2t\/.zarray": "{\n    \"chunks\": [\n        361,\n        720\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<f8\",\n    \"fill_value\": 3.4028234663852886e+38,\n    \"filters\": [\n        {\n            \"dtype\": \"float64\",\n            \"id\": \"grib\",\n            \"var\": \"2t\"\n        }\n    ],\n    \"order\": \"C\",\n    \"shape\": [\n        361,\n        720\n    ],\n    \"zarr_format\": 2\n}",
        "2t\/0.0": [
            "{{u}}",
            11856882,
            242000
        ],
        "2t\/.zattrs": "{\n    \"NV\": 0,\n    \"_ARRAY_DIMENSIONS\": [\n        \"longitude\",\n        \"latitude\"\n    ],\n    \"cfName\": \"air_temperature\",\n    \"cfVarName\": \"t2m\",\n    \"dataDate\": 20221011,\n    \"dataTime\": 0,\n    \"dataType\": \"cf\",\n    \"endStep\": 0,\n    \"gridDefinitionDescription\": \"Latitude\/longitude. Also called equidistant cylindrical, or Plate Carree\",\n    \"gridType\": \"regular_ll\",\n    \"missingValue\": 3.4028234663852886e+38,\n    \"name\": \"2 metre temperature\",\n    \"numberOfPoints\": 259920,\n    \"paramId\": 167,\n    \"shortName\": \"2t\",\n    \"stepType\": \"instant\",\n    \"stepUnits\": 1,\n    \"totalNumber\": 30,\n    \"typeOfLevel\": \"heightAboveGround\",\n    \"units\": \"K\"\n}",
        "heightAboveGround\/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<i8\",\n    \"fill_value\": null,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
        "heightAboveGround\/0": "\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
        "heightAboveGround\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"heightAboveGround\"\n    ],\n    \"long_name\": \"height above the surface\",\n    \"positive\": \"up\",\n    \"standard_name\": \"height\",\n    \"units\": \"m\"\n}",
        "latitude\/.zarray": "{\n    \"chunks\": [\n        361,\n        720\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<f8\",\n    \"fill_value\": null,\n    \"filters\": [\n        {\n            \"dtype\": \"float64\",\n            \"id\": \"grib\",\n            \"var\": \"latitude\"\n        }\n    ],\n    \"order\": \"C\",\n    \"shape\": [\n        361,\n        720\n    ],\n    \"zarr_format\": 2\n}",
        "latitude\/0.0": [
            "{{u}}",
            11856882,
            242000
        ],
        "latitude\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"latitude\"\n    ],\n    \"long_name\": \"latitude\",\n    \"standard_name\": \"latitude\",\n    \"units\": \"degrees_north\"\n}",
        "longitude\/.zarray": "{\n    \"chunks\": [\n        361,\n        720\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<f8\",\n    \"fill_value\": null,\n    \"filters\": [\n        {\n            \"dtype\": \"float64\",\n            \"id\": \"grib\",\n            \"var\": \"longitude\"\n        }\n    ],\n    \"order\": \"C\",\n    \"shape\": [\n        361,\n        720\n    ],\n    \"zarr_format\": 2\n}",
        "longitude\/0.0": [
            "{{u}}",
            11856882,
            242000
        ],
        "longitude\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"longitude\"\n    ],\n    \"long_name\": \"longitude\",\n    \"standard_name\": \"longitude\",\n    \"units\": \"degrees_east\"\n}",
        "number\/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<i8\",\n    \"fill_value\": null,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
        "number\/0": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
        "number\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"number\"\n    ],\n    \"long_name\": \"ensemble member numerical id\",\n    \"standard_name\": \"realization\",\n    \"units\": \"1\"\n}",
        "step\/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<f8\",\n    \"fill_value\": null,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
        "step\/0": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
        "step\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"step\"\n    ],\n    \"long_name\": \"time since forecast_reference_time\",\n    \"standard_name\": \"forecast_period\",\n    \"units\": \"hours\"\n}",
        "time\/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<i8\",\n    \"fill_value\": null,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
        "time\/0": "base64:ALJEYwAAAAA=",
        "time\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"time\"\n    ],\n    \"calendar\": \"proleptic_gregorian\",\n    \"long_name\": \"initial time of forecast\",\n    \"standard_name\": \"forecast_reference_time\",\n    \"units\": \"seconds since 1970-01-01T00:00:00\"\n}",
        "valid_time\/.zarray": "{\n    \"chunks\": [\n        1\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<i8\",\n    \"fill_value\": null,\n    \"filters\": null,\n    \"order\": \"C\",\n    \"shape\": [\n        1\n    ],\n    \"zarr_format\": 2\n}",
        "valid_time\/0": "base64:ALJEYwAAAAA=",
        "valid_time\/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"valid_time\"\n    ],\n    \"calendar\": \"proleptic_gregorian\",\n    \"long_name\": \"time\",\n    \"standard_name\": \"time\",\n    \"units\": \"seconds since 1970-01-01T00:00:00\"\n}"
    },
    "templates": {
        "u": "s3:\/\/noaa-gefs-pds\/gefs.20221011\/00\/atmos\/pgrb2ap5\/gec00.t00z.pgrb2a.0p50.f000"
    }
}

I'm having some difficulty interpreting the error. The underlying data files clearly have two dimensions (latitude, longitude), which I have confirmed by opening the grib files directly. xarray is picking up on this (it says ndim=2) but for some reason only one of the dimensions is being passed into something that needs both? I don't have much experience interpreting the .json file but I'm not seeing any obvious problems. I'm also unaware of any args/kwargs to scan_grib or open_mfdataset that would address the underlying issue.

I also tried to open the files via MultiZarrToZarr and xr.open_dataset as in the HRRR tutorial but got the same error.

Is this a problem that can be fixed easily? Or is there some kind of underlying incompatibility with GEFS grib files at the moment? And if the current version of scan_grib won't work, do you think going the cogrib route has a chance of solving the problem?

Thanks in advance for your help!

Lucien

@martindurant
Copy link
Member

It seems that cfgrib and eccodes treat 1D lat/lon arrays differently... Please try the linked PR.

@lucien-sim
Copy link
Author

That worked! The data loads now. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants