Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Getting data CF-compliant for regridding (with xesmf) #718

Open
pochedls opened this issue Dec 20, 2024 · 8 comments
Open

[Bug]: Getting data CF-compliant for regridding (with xesmf) #718

pochedls opened this issue Dec 20, 2024 · 8 comments
Assignees
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@pochedls
Copy link
Collaborator

pochedls commented Dec 20, 2024

What happened?

In regridding some ocean data, many (not all) datasets are returning:

ValueError: dataset must include lon/lat or be CF-compliant

I'm having trouble getting the datasets to be CF-compliant on xcdat 0.6.1 / 0.7.1 / 0.7.3. I think xcdat used to better handle ocean grids (though I'm not sure when this might have changed).

What did you expect to happen? Are there are possible answers you came across?

Ideally xcdat could figure out the input grid automatically. If not, this issue is still useful for sorting out what we need to specify in the metadata to make the dataset CF-compliant for xesmf.

Minimal Complete Verifiable Example (MVCE)

# imports
import xcdat as xc
import numpy as np

# target grid
nlat = xc.create_axis('lat', np.arange(-88.5, 90, 2.5))
nlon = xc.create_axis('lon', np.arange(1.25, 360, 2.5))
ngrid = xc.create_grid(x=nlon, y=nlat)

# open / regrid
p = '/p/css03/esgf_publish/CMIP6/ScenarioMIP/NCAR/CESM2-WACCM/ssp585/r1i1p1f1/Omon/fgco2/gn/v20200702/'
ds = xc.open_mfdataset(p)
ds = ds.regridder.horizontal('fgco2', ngrid, tool='xesmf', method='conservative_normed', periodic=True)

Relevant log output

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/bin/anaconda3/envs/xcdat/lib/python3.12/site-packages/xesmf/frontend.py:75, in _get_lon_lat(ds)
     74 try:
---> 75     lon = ds.cf['longitude']
     76     lat = ds.cf['latitude']

File ~/bin/anaconda3/envs/xcdat/lib/python3.12/site-packages/cf_xarray/accessor.py:2343, in CFDatasetAccessor.__getitem__(self, key)
   2312 """
   2313 Index into a Dataset making use of CF attributes.
   2314
   (...)
   2341 Add additional keys by specifying "custom criteria". See :ref:`custom_criteria` for more.
   2342 """
-> 2343 return _getitem(self, key)

File ~/bin/anaconda3/envs/xcdat/lib/python3.12/site-packages/cf_xarray/accessor.py:936, in _getitem(accessor, key, skip)
    935 except KeyError:
--> 936     raise KeyError(
    937         f"{kind}.cf does not understand the key {k!r}. "
    938         f"Use 'repr({kind}.cf)' (or '{kind}.cf' in a Jupyter environment) to see a list of key names that can be interpreted."
    939     ) from None

KeyError: "Dataset.cf does not understand the key 'longitude'. Use 'repr(Dataset.cf)' (or 'Dataset.cf' in a Jupyter environment) to see a list of key names that can be interpreted."

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[1], line 13
     11 p = '/p/css03/esgf_publish/CMIP6/ScenarioMIP/NCAR/CESM2-WACCM/ssp585/r1i1p1f1/Omon/fgco2/gn/v20200702/'
     12 ds = xc.open_mfdataset(p)
---> 13 ds = ds.regridder.horizontal('fgco2', ngrid, tool='xesmf', method='conservative_normed', periodic=True)

File ~/bin/anaconda3/envs/xcdat/lib/python3.12/site-packages/xcdat/regridder/accessor.py:205, in RegridderAccessor.horizontal(self, data_var, output_grid, tool, **options)
    203 input_grid = _get_input_grid(self._ds, data_var, ["X", "Y"])
    204 regridder = regrid_tool(input_grid, output_grid, **options)
--> 205 output_ds = regridder.horizontal(data_var, self._ds)
    207 return output_ds

File ~/bin/anaconda3/envs/xcdat/lib/python3.12/site-packages/xcdat/regridder/xesmf.py:158, in XESMFRegridder.horizontal(self, data_var, ds)
    153 if input_da is None:
    154     raise KeyError(
    155         f"The data variable '{data_var}' does not exist in the dataset."
    156     )
--> 158 regridder = xe.Regridder(
    159     self._input_grid,
    160     self._output_grid,
    161     method=self._method,
    162     **self._extra_options,
    163 )
    165 output_da = regridder(input_da, keep_attrs=True)
    167 output_ds = xr.Dataset({data_var: output_da}, attrs=ds.attrs)

File ~/bin/anaconda3/envs/xcdat/lib/python3.12/site-packages/xesmf/frontend.py:919, in Regridder.__init__(self, ds_in, ds_out, method, locstream_in, locstream_out, periodic, parallel, **kwargs)
    917     grid_in, shape_in, input_dims = ds_to_ESMFlocstream(ds_in)
    918 else:
--> 919     grid_in, shape_in, input_dims = ds_to_ESMFgrid(
    920         ds_in, need_bounds=need_bounds, periodic=periodic
    921     )
    922 if locstream_out:
    923     grid_out, shape_out, output_dims = ds_to_ESMFlocstream(ds_out)

File ~/bin/anaconda3/envs/xcdat/lib/python3.12/site-packages/xesmf/frontend.py:145, in ds_to_ESMFgrid(ds, need_bounds, periodic, append)
    115 """
    116 Convert xarray DataSet or dictionary to ESMF.Grid object.
    117
   (...)
    141
    142 """
    143 # use np.asarray(dr) instead of dr.values, so it also works for dictionary
--> 145 lon, lat = _get_lon_lat(ds)
    146 if hasattr(lon, 'dims'):
    147     if lon.ndim == 1:

File ~/bin/anaconda3/envs/xcdat/lib/python3.12/site-packages/xesmf/frontend.py:80, in _get_lon_lat(ds)
     76     lat = ds.cf['latitude']
     77 except (KeyError, AttributeError, ValueError):
     78     # KeyError if cfxr doesn't detect the coords
     79     # AttributeError if ds is a dict
---> 80     raise ValueError('dataset must include lon/lat or be CF-compliant')
     82 return lon, lat

ValueError: dataset must include lon/lat or be CF-compliant

Anything else we need to know?

Note the same code works with p = '/p/css03/esgf_publish/CMIP6/ScenarioMIP/BCC/BCC-CSM2-MR/ssp585/r1i1p1f1/Omon/fgco2/gn/v20190319/' (though I now think this may be because the grid is rectilinear).

Environment

xcdat 0.7.3
xesmf 0.8.8

@pochedls pochedls added the type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Dec 20, 2024
@pochedls
Copy link
Collaborator Author

I see in xesmf docs that:

periodic : bool, optional
Periodic in longitude? Default to False.
Only useful for global grids with non-conservative regridding.
Will be forced to False for conservative regridding.

Updating this to False doesn't fix the issue.

@pochedls
Copy link
Collaborator Author

The following seems to be working as a work-around:

# imports
import xesmf as xe
import xcdat as xc
import numpy as np

# target grid
nlat = xc.create_axis('lat', np.arange(-88.5, 90, 2.5))
nlon = xc.create_axis('lon', np.arange(1.25, 360, 2.5))
ngrid = xc.create_grid(x=nlon, y=nlat)

# open / regrid
p = '/p/css03/esgf_publish/CMIP6/ScenarioMIP/NCAR/CESM2-WACCM/ssp585/r1i1p1f1/Omon/fgco2/gn/v20200702/'
ds = xc.open_mfdataset(p)

# regrid with xesmf directly
regridder = xe.Regridder(ds, ngrid, "conservative_normed", ignore_degenerate=True)
ds = regridder(ds)

# regenerate bounds
for b in ['bounds_lat', 'bounds_lon', 'lat_bnds', 'lon_bnds']:
        if b in ds.data_vars:
                ds = ds.drop_vars(b)
ds = ds.bounds.add_missing_bounds()

I think this might indicate that something changed with how xcdat passes information to xesmf (or xesmf changed).

@tomvothecoder
Copy link
Collaborator

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/bin/anaconda3/envs/xcdat/lib/python3.12/site-packages/xesmf/frontend.py:75, in _get_lon_lat(ds)
     74 try:
---> 75     lon = ds.cf['longitude']
     76     lat = ds.cf['latitude']

I've isolated the code block in xESMF where the KeyError is being raised and found that it has not changed in 5 years.

https://github.com/pangeo-data/xESMF/blob/feebcfc11147eb00e3256924f0fded1485b7df60/xesmf/frontend.py#L68-L82

Also your workaround above might indicate that xCDAT is dropping or not copying over attributes to the lat/lon (which is a regression in behavior from previous versions I think?).

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jan 7, 2025

Looking at the CF info via cf_xarray, both ngrid and ds have CF compliant information before calling the xCDAT regridder.

ngrid.cf

Coordinates:
             CF Axes: * X: ['lon']
                      * Y: ['lat']
                        Z, T: n/a

      CF Coordinates: * longitude: ['lon']
                      * latitude: ['lat']
                        vertical, time: n/a

       Cell Measures:   area, volume: n/a

      Standard Names:   n/a

              Bounds:   n/a

       Grid Mappings:   n/a

Data Variables:
       Cell Measures:   area, volume: n/a

      Standard Names:   n/a

              Bounds:   X: ['lon_bnds']
                        Y: ['lat_bnds']
                        lat: ['lat_bnds']
                        latitude: ['lat_bnds']
                        lon: ['lon_bnds']
                        longitude: ['lon_bnds']

       Grid Mappings:   n/a
ds.cf

Coordinates:
             CF Axes:   X: ['lon', 'nlon']
                        Y: ['lat', 'nlat']
                      * T: ['time']
                        Z: n/a

      CF Coordinates:   longitude: ['lon']
                        latitude: ['lat']
                      * time: ['time']
                        vertical: n/a

       Cell Measures:   area, volume: n/a

      Standard Names:   latitude: ['lat']
                        longitude: ['lon']
                      * time: ['time']

              Bounds:   n/a

       Grid Mappings:   n/a

Data Variables:
       Cell Measures:   area, volume: n/a

      Standard Names:   surface_downward_mass_flux_of_carbon_dioxide_expressed_as_carbon: ['fgco2']

              Bounds:   T: ['time_bnds']
                        X: ['lon_bnds', 'nlon_bnds']
                        Y: ['lat_bnds']
                        lat: ['lat_bnds']
                        latitude: ['lat_bnds']
                        lon: ['lon_bnds']
                        longitude: ['lon_bnds']
                        nlon: ['nlon_bnds']
                        time: ['time_bnds']

       Grid Mappings:   n/a

@pochedls
Copy link
Collaborator Author

pochedls commented Jan 7, 2025

@tomvothecoder – thank you for digging into this...before you get too far – can you reproduce the error? Maybe the issue is on my end (weird environment and old/new cf-xarray or something)?

@tomvothecoder
Copy link
Collaborator

@pochedls I can reproduce your issue with the latest main and xcdat 0.7.0.

I found the root cause of this issue. The dataset has two X and Y axes (nlat/lat, nlon/lon).

<xarray.Dataset> Size: 2GB
Dimensions:    (time: 3420, nlat: 384, nlon: 320, d2: 2, vertices: 4, bnds: 2)
Coordinates:
    lat        (nlat, nlon) float64 983kB dask.array<chunksize=(384, 320), meta=np.ndarray>
    lon        (nlat, nlon) float64 983kB dask.array<chunksize=(384, 320), meta=np.ndarray>
  * nlat       (nlat) int32 2kB 1 2 3 4 5 6 7 8 ... 378 379 380 381 382 383 384
  * nlon       (nlon) int32 1kB 1 2 3 4 5 6 7 8 ... 314 315 316 317 318 319 320
  * time       (time) object 27kB 2015-01-15 13:00:00.000007 ... 2299-12-15 1...

The input grid that is extracted (code here) from the dataset consists of nlat and nlon since they are the dimension coordinates (indicated by *). nlat and nlon do not have CF-compliant attributes, resulting in the error ValueError: dataset must include lon/lat or be CF-compliant raised by xESMF/CF-Xarray.

ds.nlat.attrs
{'long_name': 'cell index along second dimension', 'units': '1'}

ds.nlon.attrs
{'long_name': 'cell index along first dimension', 'units': '1', 'bounds': 'nlon_bnds'}

Meanwhile, lat and lon have the correct CF-compliant attributes, but they are auxiliary coordinates that are not considered.

ds.lat.attrs
{'axis': 'Y', 'bounds': 'lat_bnds', 'standard_name': 'latitude', 'title': 'Latitude', 'type': 'double', 'units': 'degrees_north', 'valid_max': np.float64(90.0), 'valid_min': np.float64(-90.0)}

ds.lon.attrs
{'axis': 'X', 'bounds': 'lon_bnds', 'standard_name': 'longitude', 'title': 'Longitude', 'type': 'double', 'units': 'degrees_east', 'valid_max': np.float64(360.0), 'valid_min': np.float64(0.0)}

Are lat and lon supposed to be the dimension coordinates rather than auxiliary coordinates?

@pochedls
Copy link
Collaborator Author

pochedls commented Jan 8, 2025

Got it. I think xcdat used to be able to sort this out since this data structure is what is used for unstructured grids (which can be regridded with ESMF). I'll see if I can find examples in which this did work in the past (I know we tested regridding sea ice or sea surface temperature).

@pochedls
Copy link
Collaborator Author

pochedls commented Jan 8, 2025

It is hard to trace, but I think that an example dataset that we used to be able to regrid, but is now problematic is this one:

/p/css03/esgf_publish/CMIP6/CMIP/NCAR/CESM2-WACCM/historical/r1i1p1f1/SImon/siconc/gn/v20190227/

I think @jasonb5 was able to get this working by resolving this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
Status: Todo
Development

No branches or pull requests

3 participants