Make `mllam-data-prep` aware of geographical coordinates (by introducing projection info) #33

leifdenby · 2024-11-01T10:39:31Z

The problem

Currently, mllam-data-prep has no understanding of projection information, which means that it doesn't know how to define the latitude and longitude coordinate values for a given input dataset or how to define the same in zarr datasets produced by mllam-data-prep. We could of course require that latitude and longitude values are given for example as variables with the specific names (lat and lon for example) in each input dataset, but that seems limiting to me since deriving latitude and longitude coordinate values from the coordinate values actually used in a dataset is exactly what the projection information is for. If a given input dataset is in fact given on a lat/lon grid I vote we make that explicit (by requiring that the projection be set as cartopy.crs.PlateCarree) - give example configs below.

To my mind being able to define the latitude and longitude coordinate values is necessary for:

in neural-lam these are required so that lat / lon can be returned by a datastore and the projection used when plotting
being able to define the lat/lon coordinate convex hull that @TomasLandelius will use in implementing the geographical coordinate polygon-based approach to combining datasets on different spatial domains (e.g. DANRA and ERA5), Distance-Based Indexing of Boundary Dataset Points weather-model-graphs#30
creating derived variables that require the lat/lon coordinates to compute the value of a derived variable as @ealerskans is working on in Add ability to derive fields from input datasets #29

Proposed solution

I propose to provide two ways of deriving projection information in mllam-data-prep:

Implement functionality to parse the projection information from input datasets where projection information is stored in a CF-compliant manner.
Introduce a new projection information config parameter available on InputDataset's in the config, where the type of projection (given as a cartopy.crs.Projection ClassName, e.g. LambertConformal)

The new projection section to the input section would have three values to set the dimension names for the dimensions used in the projection (called dims), the cartopy.crs.Projection-derived projection class name (class_name) and the kwargs passed to this ProjectionClass to create an instance of the projection object. In the case where we can parse the projection from from CF-compliant attributes and the user has also defined projection in the config we warn the user that they are overwriting the projection information.

I will give two examples of what this would look like:

a) For DANRA (which is given on a Lambert Conformal grid) this would look like the following:

inputs:
  danra_surface:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
    projection:
      dims: [x, y]
      class_name: LambertConformal
      kwargs:
        central_longitude: 15.0
        central_latitude: 63.0
        standard_parallels: [63.0, 63.0]
  ...

b) for ERA5 (which is stored on a lat/lon grid) stored in Google's S3 bucket:

inputs:
  era5_arco:
    path: gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2
    projection:
      dims: [lon, lat]
      class_name: PlateCarree
  ...

Having this new config option will then make it possible to implement a function (say get_input_dataset_lat_lons) that can be used to get the lat/lon values for each grid-point. The output of this function can then be used when implementing th dataset merging that @TomasLandelius is working on.

This would have the following limitation which could be handled later: All input datasets would have to be given on the same projection. This limitation primarily stems from the fact that mllam-data-prep requires that all input datasets share the exact same grid coordinates. We could in future implement reprojection and interpolation in mllam-data-prep, but I was going to aim for that in this piece of work. This would only be for adding "how to extract the lat lon coordinates for a given input dataset" and "given that all input datasets share the same projection and coordinates how to write this projection information to the output dataset".

As I mentioned in #31 if we have a new __common__ section in the config then the user would only have to write this projection info once in the config.

I would greatly appreciate your input @joeloskarsson, @khintz, @observingClouds, @sadamov, @SimonKamuk

The text was updated successfully, but these errors were encountered:

observingClouds · 2024-11-01T11:58:35Z

This sounds to me like a good and useful proposal!

This functionality will also be of interest for the derived variable section (#29) which might rely on something like get_input_dataset_lat_lons as well to create the right inputs for the derivation function (@ealerskans)

joeloskarsson · 2024-11-01T13:54:13Z

My first thought would probably be to just:

We could of course require that latitude and longitude values are given for example as variables with the specific names (lat and lon for example) in each input dataset, but that seems limiting to me since deriving latitude and longitude coordinate values from the coordinate values actually used in a dataset is exactly what the projection information is for.

But if there is interest to build this functionality into mdp then I think that is great! If the most common situation is that you have your LAM data sitting without lat-lons then you would anyhow have to do this conversion, so if mdp can do it for you that is really nice.

One thing to keep in mind is that neural-lam will eventually need this projection information explicitly in the config: https://github.com/mllam/neural-lam/blob/7112013f24ad36d8d9d19b4b5f853b11a2bbebf4/neural_lam/data_config.yaml#L59-L64 (or something similar after mllam/neural-lam#66 is merged). So in that sense the user never gets away from specifying the projection, even in the case that it is correctly stored in the input dataset. But this seems a bit unnecessary, so maybe a better option would be to have mdp always store the projection information in the output zarr and let neural-lam just use that? (more a question for neural-lam really, but becomes relevant for this)

leifdenby · 2024-11-07T09:42:02Z

@observingClouds just adding a few links here that I found it looking more into how to best define projection information. The key thing I learnt is that cartopy (as of version 2.2) uses pyproj classes under-the-hood to create projection objects and do transforms, and it also supports going between cartopy.crs.Projection-derived classes and pyproj.Projection classes.

links:

on how to go between different representations of projections in code, include cartopy to/from pyproj: https://pyproj4.github.io/pyproj/stable/crs_compatibility.html
metpy discussion on moving from using cartopy for transform calculations to pyproj directly: Use PyProj instead of CartoPy for calculations Unidata/MetPy#1455 and Use PyProj instead of Cartopy for internal coordinate transforms and rename crs coordinate Unidata/MetPy#1483

This makes me feel like that instead of using cartopy to handle projections here in mllam-data-prep we should use pyproj since pyproj is used by cartopy anyway under-the-hood, cartopy includes plotting functionality that we don't need here (so will also include matplotlib) and thus is quite a heavy dependency. This would mean that the projection info should be defined by the means that pyproj uses rather than cartopy. I think that would likely be the WKT format string, but there is also something called WKT2 apparently. It would be nice to work out how/if this plays together with the CF-conventions too, i.e. if for example we could use the WKT string and set an attribute and the CF-compliant downstream tools understand that, that would be amazing.

I would suggest we then add a convenience function so that if someone has a projection as a cartopy.crs.Projection object we include directions on how to produce the pyproj projection representation, i.e. we include instructions/code to generate the WKT string from a cartopy.crs.Projection object. Maybe that could just be a line of code in the README where we discuss these things about projections.

observingClouds · 2024-11-08T12:31:36Z

From a CF-convention standpoint:

The crs_wkt attribute is intended to act as a supplement to other single-property CF grid mapping attributes [...]; it is not intended to replace those attributes. [...] the CRS should be described as thoroughly as possible with the single-property grid mapping attributes as well as by crs_wkt.

values shown in https://github.com/cf-convention/cf-conventions/wiki/Mapping-from-CF-Grid-Mapping-Attributes-to-CRS-WKT-Elements should be preferred; these are derived from the OGP/EPSG registry of geodetic parameters, which is considered to represent the definitive authority as regards CRS property names and values.

From pyproj view:

converter is offered to convert coordinates and projection from CF to CRS and vice versa.

When importing a CRS from [...CF], you need both the grid mapping as well as the coordinate system. If you don’t use the coordinate system, then you will lose the units of your projection. [Except if] the CF crs_wkt attribute is available, the coordinate system is inside of the WKT and can be used to create the CRS in a single step.

https://proj.org/en/9.5/faq.html#what-is-the-best-format-for-describing-coordinate-reference-systems

Preliminary conclusions:

crs_wkt seems a good way to store coordinate system and projection information
to ensure CF compliant output files, both crs_wkt and single-property CF grid mapping attributes shall be set
output with CF compliant attributes can be created with pyproj's functionality

observingClouds · 2024-11-10T13:41:27Z

Based on brainstorming with @SimonKamuk the introduction of projection information raises a few sub-questions:

do we want to support CF-conform projection information of the input datasets?
do we want to ensure that the datasets created by mllam-data-prep contains CF-conform proj info?
do we want to be able to add/manipulate the projection information of the incoming dataset?
do we want to support several projections per input dataset?
do we want to support several projections per output dataset?
- example: satellite observations and model variables on different projections
- example: ocean model and atmospheric model on different grids

A bit of background information to answer these questions:

as mentioned in previous comment, CF requires at minimum single-property CF grid mapping attributes
crs_wkt is optional but WKT is best format to save/exchange projection information
crs information has to be assigned to each variable with grid_mapping attribute
grid_mapping attribute links variable to randomly named projection variable
- note: projections are saved as variables (not global attributes)
- note: crs varibables can have any name (not necessarily crs)

Implementation questions/suggestions:

How to get projection information
- Metpy
  - Example
  - xr.Dataset().metpy.parse_cf() parses CF conform project information and allows access to cartopy_crs or pyproj_crs for each variable, e.g. ds[var].metpy.pyproj_crs
  - crs_wkt attribute will be used over single-entry CF attributes
  - introduces dependencies to several packages including matplotlib, pyproj, cartopy
- Pyproj
  - Example
  - identification of projection variable needs to be done manually via checking each variable for the grid_mapping attribute
  - pyproj.CRS.from_cf() allows to create pyproj_crs from identified crs variable attributes
  - crs_wkt attribute will be used over single-entry CF attributes
  - pyproj_CRS.from_wkt() allows to interpret crs_wkt string directly
  - offers option to create cf-conform attribute dictionary from projection with crs.to_cf()
- Xarray
  - Example
  - identifies crs variables based on grid_mapping attribute and assigns them as coordinates
  - cf_xarray
    - Example
    - is able to automatically get grid_mapping variables
    Based on these arguments the pyproj implementation seems to be the most lightweight solution. The check of the grid_mapping attribute can easily be implemented.

How to configure the input projection

Input file specifications
- projection(s) are centrally defined as variables
- each variable needs a grid_mapping attribute to be connected with a projection
- if proj provided in input file, cf-conform single-value attributes or crs_wkt needs to be provided, with the later being prioritized

Config file implementation

inputs:
  danra_surface:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
    variables:
      u:
        coordinates:
          altitude:
            values: [100,]
        attributes:  # will update ds.u.attrs
          grid_mapping: LambertConformal
    projections:
      LambertConformal:  # will become grid_mapping variable name
        dims: [x, y]
        kwargs:  # will be interpreted by pyproj.CRS.from_cf()
          crs_wkt: "GEODCRS['WGS 84',DATUM['World Geodetic System 1984', ELLIPSOID['WGS 84',6378137,..." # optional if single-attributes are provided
          central_longitude: 15.0  # optional if crs_wkt is given
          central_latitude: 63.0
          standard_parallels: [63.0, 63.0]

This emphasizes again the need to restructure the variables section and intruce keys like coordinates.

How to write the projection information to the output file
- identify variables which are linked to projections (grid_mapping attr present in input) with cf_xarray
- ensure grid_mapping is written to output
- use pyprojs crs.to_cf() to create CF-conform projection information for each projection
- use xarray to create empty variable and attach output of crs.to_cf() as attributes (ds[crs.name].attrs = crs.to_cf())

Note

As one can see, adding cf-compliant projection information is not as straightforward as maybe initially thought. Nevertheless, thanks to the great work done in the external packages an implementation seems doable. It should be noted though that the libraries and specifications are still far from perfect and things will fail.

I created a small repository to test the different packages. My observations are:

Some of the CF-convention example CRSs are either incorrect or pyproj cannot read them/misses the database by default
reading cf-compliant crs input may result in different crs output when using pyproj.CRS.from_CF() and crs.to_cf(), mostly with additional attributes
the grid_mapping attribute can have different formats that are not yet always implemented, e.g. in xarray

observingClouds · 2024-11-10T13:42:01Z

@SimonKamuk feel free to add/edit in case I missed something

leifdenby · 2024-11-11T08:29:56Z

Thanks for this @observingClouds and @SimonKamuk! Really great overview.

In terms of the questions you ask I would suggest we try and stage the implementation.

do we want to support CF-conform projection information of the input datasets?
do we want to ensure that the datasets created by mllam-data-prep contains CF-conform proj info?
do we want to be able to add/manipulate the projection information of the incoming dataset?
do we want to support several projections per input dataset?
do we want to support several projections per output dataset?

How about the following plan:

first revision:
- CF-conforming projection info in output of mllam-data-prep
- Ability to define in mllam-data-prep config: projection info that will set (overwrite if present) for all selected variables that have the projection coordinates as dimensions
- only support one projection in output, no support for reading projection from input
second revision:
- add support for reading projection from input. Still only support one projection
third revision:
- add support for reading/writing in multiple projections. Will need change of config, but also rethink of how mllam-data-prep works since it is currently assumed that all variables and datasets share the exact same coordinates

One thing I am unsure about:

Does it even make sense to try and store the projection info in the output of mllam-data-prep in a cf-compability manner? I ask because the dimensions used in the output will typically be [grid_index,time,state_feature], but the projection coordinates will be "auxiliary" coordinates, i.e. be set in ds.coords but not actually used as a dimension of any of the variables in the output.

observingClouds · 2024-11-11T09:41:51Z

The plan sounds great!

Does it even make sense to try and store the projection info in the output of mllam-data-prep in a cf-compability manner?

Maybe I return this question and ask if we want to store this information in general. If so, then the CF-convention are a great solution even if we break them per variable (e.g. because they are auxiliary coords) but use the convention in the crs variable.

leifdenby · 2024-11-11T10:01:56Z

Maybe I return this question and ask if we want to store this information in general. If so, then the CF-convention are a great solution even if we break them per variable (e.g. because they are auxiliary coords) but use the convention in the crs variable.

Yes, sorry. I meant to say whether it is actually possible, not whether it makes sense. Maybe you could start there to see if you simple set the correct attributes in the output from mllam-data-prep (for example just the example DANRA config file) and then get pyproj to parse the info? If that works then I think we have a plan 😄 Thanks again!

leifdenby · 2024-11-12T15:51:43Z

@observingClouds I didn't manage to comment on your config suggestion change. If we assume that all inputs use the same projection could we start with the following:

inputs:
  danra_surface:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
    variables:
      u:
        altitude:
          values: [100,]
    projection:
      dims: [x, y]
      crs_wkt: "GEODCRS['WGS 84',DATUM['World Geodetic System 1984', ELLIPSOID['WGS 84',6378137,..."

?

I appreciate that you would like to have inputs.{dataset_name}.coordinates and inputs.{dataset_name}.attributes as we're discussing in #30, but I feel a little unsure about changing (as in breaking existing configs) rather than introducing a new optional field to the config. We can also discuss this more if you'd like :) maybe we need voice it to more people to get some feedback.

observingClouds · 2024-11-12T16:17:34Z

Well, I really would like to get the grid_mapping attribute set. I can assume that for now being set for each variable, but I think having the option to set that directly in the config makes stage 2 straight forward (is included basically for free). With keeping the config sections general and mimic xarrays settings (coords, encoding, attrs) the config remains simple and generalizes. Let me do the implementation PR and we can continue the discussion to see if my suggested change is a bad idea ;)

observingClouds · 2024-11-15T07:52:29Z

@leifdenby #38 follows now mostly the above mentioned structure, with just two minor adjustments:

key name projections instead of projection (to allow for future non-breaking changes)
consequently an additional subsection with the projection name

In summary:

inputs:
  danra_surface:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
    variables:
      u:
        altitude:
          values: [100,]
    projections:
      my_crs_name:
          dims: [x, y]
          crs_wkt: "GEODCRS['WGS 84',DATUM['World Geodetic System 1984', ELLIPSOID['WGS 84',6378137,..."

observingClouds · 2024-11-19T17:12:09Z

Because we do not only want to read and save complete CF-conform projection information, but also be able to plot the data on a given projection this issue and connected PR depend on upstream issues: SciTools/cartopy#2479

observingClouds · 2024-11-20T08:10:37Z

After some deeper dive, it turns out that by providing a valid BBOX entry to the WKT strings (e.g. do not include the poles for cylindrical projections), plotable cartopy Projection objects can be created from WKT strings.

leifdenby mentioned this issue Nov 8, 2024

add section for extra info in config #18

Merged

observingClouds self-assigned this Nov 11, 2024

leifdenby mentioned this issue Nov 12, 2024

Add ability to derive fields from input datasets #29

Open

observingClouds linked a pull request Nov 13, 2024 that will close this issue

Add support for reading, defining and outputting CF-compliant projection info #38

Open

leifdenby mentioned this issue Nov 19, 2024

Add "datastores" to represent input data from zarr, npy, etc mllam/neural-lam#66

Merged

20 tasks

observingClouds mentioned this issue Nov 20, 2024

Allow non-regularly gridded and lat-lon coordinates mllam/weather-model-graphs#32

Merged

20 tasks

ealerskans added a commit to ealerskans/mllam-data-prep that referenced this issue Dec 9, 2024

Add dummy function for getting lat,lon (preparation for mllam#33)

fbb6065

ealerskans mentioned this issue Dec 11, 2024

Add ability to derive variables and add selected derived forcings #34

Open

4 tasks

leifdenby mentioned this issue Dec 12, 2024

Add support for cropping domain with padded lat/lon convex hull #45

Open

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `mllam-data-prep` aware of geographical coordinates (by introducing projection info) #33

Make `mllam-data-prep` aware of geographical coordinates (by introducing projection info) #33

leifdenby commented Nov 1, 2024 •

edited

Loading

observingClouds commented Nov 1, 2024

joeloskarsson commented Nov 1, 2024

leifdenby commented Nov 7, 2024

observingClouds commented Nov 8, 2024 •

edited

Loading

observingClouds commented Nov 10, 2024 •

edited

Loading

observingClouds commented Nov 10, 2024

leifdenby commented Nov 11, 2024

observingClouds commented Nov 11, 2024

leifdenby commented Nov 11, 2024

leifdenby commented Nov 12, 2024 •

edited

Loading

observingClouds commented Nov 12, 2024

observingClouds commented Nov 15, 2024

observingClouds commented Nov 19, 2024

observingClouds commented Nov 20, 2024

Make mllam-data-prep aware of geographical coordinates (by introducing projection info) #33

Make mllam-data-prep aware of geographical coordinates (by introducing projection info) #33

Comments

leifdenby commented Nov 1, 2024 • edited Loading

The problem

Proposed solution

observingClouds commented Nov 1, 2024

joeloskarsson commented Nov 1, 2024

leifdenby commented Nov 7, 2024

observingClouds commented Nov 8, 2024 • edited Loading

observingClouds commented Nov 10, 2024 • edited Loading

Implementation questions/suggestions:

Note

observingClouds commented Nov 10, 2024

leifdenby commented Nov 11, 2024

observingClouds commented Nov 11, 2024

leifdenby commented Nov 11, 2024

leifdenby commented Nov 12, 2024 • edited Loading

observingClouds commented Nov 12, 2024

observingClouds commented Nov 15, 2024

observingClouds commented Nov 19, 2024

observingClouds commented Nov 20, 2024

Make `mllam-data-prep` aware of geographical coordinates (by introducing projection info) #33

Make `mllam-data-prep` aware of geographical coordinates (by introducing projection info) #33

leifdenby commented Nov 1, 2024 •

edited

Loading

observingClouds commented Nov 8, 2024 •

edited

Loading

observingClouds commented Nov 10, 2024 •

edited

Loading

leifdenby commented Nov 12, 2024 •

edited

Loading