Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make mllam-data-prep aware of geographical coordinates (by introducing projection info) #33

Open
leifdenby opened this issue Nov 1, 2024 · 14 comments · May be fixed by #38
Open

Make mllam-data-prep aware of geographical coordinates (by introducing projection info) #33

leifdenby opened this issue Nov 1, 2024 · 14 comments · May be fixed by #38
Assignees

Comments

@leifdenby
Copy link
Member

leifdenby commented Nov 1, 2024

The problem

Currently, mllam-data-prep has no understanding of projection information, which means that it doesn't know how to define the latitude and longitude coordinate values for a given input dataset or how to define the same in zarr datasets produced by mllam-data-prep. We could of course require that latitude and longitude values are given for example as variables with the specific names (lat and lon for example) in each input dataset, but that seems limiting to me since deriving latitude and longitude coordinate values from the coordinate values actually used in a dataset is exactly what the projection information is for. If a given input dataset is in fact given on a lat/lon grid I vote we make that explicit (by requiring that the projection be set as cartopy.crs.PlateCarree) - give example configs below.

To my mind being able to define the latitude and longitude coordinate values is necessary for:

  1. in neural-lam these are required so that lat / lon can be returned by a datastore and the projection used when plotting
  2. being able to define the lat/lon coordinate convex hull that @TomasLandelius will use in implementing the geographical coordinate polygon-based approach to combining datasets on different spatial domains (e.g. DANRA and ERA5), Distance-Based Indexing of Boundary Dataset Points weather-model-graphs#30
  3. creating derived variables that require the lat/lon coordinates to compute the value of a derived variable as @ealerskans is working on in Add ability to derive fields from input datasets #29

Proposed solution

I propose to provide two ways of deriving projection information in mllam-data-prep:

  1. Implement functionality to parse the projection information from input datasets where projection information is stored in a CF-compliant manner.
  2. Introduce a new projection information config parameter available on InputDataset's in the config, where the type of projection (given as a cartopy.crs.Projection ClassName, e.g. LambertConformal)

The new projection section to the input section would have three values to set the dimension names for the dimensions used in the projection (called dims), the cartopy.crs.Projection-derived projection class name (class_name) and the kwargs passed to this ProjectionClass to create an instance of the projection object. In the case where we can parse the projection from from CF-compliant attributes and the user has also defined projection in the config we warn the user that they are overwriting the projection information.

I will give two examples of what this would look like:

a) For DANRA (which is given on a Lambert Conformal grid) this would look like the following:

inputs:
  danra_surface:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
    projection:
      dims: [x, y]
      class_name: LambertConformal
      kwargs:
        central_longitude: 15.0
        central_latitude: 63.0
        standard_parallels: [63.0, 63.0]
  ...

b) for ERA5 (which is stored on a lat/lon grid) stored in Google's S3 bucket:

inputs:
  era5_arco:
    path: gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2
    projection:
      dims: [lon, lat]
      class_name: PlateCarree
  ...

Having this new config option will then make it possible to implement a function (say get_input_dataset_lat_lons) that can be used to get the lat/lon values for each grid-point. The output of this function can then be used when implementing th dataset merging that @TomasLandelius is working on.

This would have the following limitation which could be handled later: All input datasets would have to be given on the same projection. This limitation primarily stems from the fact that mllam-data-prep requires that all input datasets share the exact same grid coordinates. We could in future implement reprojection and interpolation in mllam-data-prep, but I was going to aim for that in this piece of work. This would only be for adding "how to extract the lat lon coordinates for a given input dataset" and "given that all input datasets share the same projection and coordinates how to write this projection information to the output dataset".

As I mentioned in #31 if we have a new __common__ section in the config then the user would only have to write this projection info once in the config.

I would greatly appreciate your input @joeloskarsson, @khintz, @observingClouds, @sadamov, @SimonKamuk

@observingClouds
Copy link
Contributor

This sounds to me like a good and useful proposal!

This functionality will also be of interest for the derived variable section (#29) which might rely on something like get_input_dataset_lat_lons as well to create the right inputs for the derivation function (@ealerskans)

@joeloskarsson
Copy link
Contributor

My first thought would probably be to just:

We could of course require that latitude and longitude values are given for example as variables with the specific names (lat and lon for example) in each input dataset, but that seems limiting to me since deriving latitude and longitude coordinate values from the coordinate values actually used in a dataset is exactly what the projection information is for.

But if there is interest to build this functionality into mdp then I think that is great! If the most common situation is that you have your LAM data sitting without lat-lons then you would anyhow have to do this conversion, so if mdp can do it for you that is really nice.

One thing to keep in mind is that neural-lam will eventually need this projection information explicitly in the config: https://github.com/mllam/neural-lam/blob/7112013f24ad36d8d9d19b4b5f853b11a2bbebf4/neural_lam/data_config.yaml#L59-L64 (or something similar after mllam/neural-lam#66 is merged). So in that sense the user never gets away from specifying the projection, even in the case that it is correctly stored in the input dataset. But this seems a bit unnecessary, so maybe a better option would be to have mdp always store the projection information in the output zarr and let neural-lam just use that? (more a question for neural-lam really, but becomes relevant for this)

@leifdenby
Copy link
Member Author

@observingClouds just adding a few links here that I found it looking more into how to best define projection information. The key thing I learnt is that cartopy (as of version 2.2) uses pyproj classes under-the-hood to create projection objects and do transforms, and it also supports going between cartopy.crs.Projection-derived classes and pyproj.Projection classes.

links:

This makes me feel like that instead of using cartopy to handle projections here in mllam-data-prep we should use pyproj since pyproj is used by cartopy anyway under-the-hood, cartopy includes plotting functionality that we don't need here (so will also include matplotlib) and thus is quite a heavy dependency. This would mean that the projection info should be defined by the means that pyproj uses rather than cartopy. I think that would likely be the WKT format string, but there is also something called WKT2 apparently. It would be nice to work out how/if this plays together with the CF-conventions too, i.e. if for example we could use the WKT string and set an attribute and the CF-compliant downstream tools understand that, that would be amazing.

I would suggest we then add a convenience function so that if someone has a projection as a cartopy.crs.Projection object we include directions on how to produce the pyproj projection representation, i.e. we include instructions/code to generate the WKT string from a cartopy.crs.Projection object. Maybe that could just be a line of code in the README where we discuss these things about projections.

@observingClouds
Copy link
Contributor

observingClouds commented Nov 8, 2024

From a CF-convention standpoint:

The crs_wkt attribute is intended to act as a supplement to other single-property CF grid mapping attributes [...]; it is not intended to replace those attributes. [...] the CRS should be described as thoroughly as possible with the single-property grid mapping attributes as well as by crs_wkt.

values shown in https://github.com/cf-convention/cf-conventions/wiki/Mapping-from-CF-Grid-Mapping-Attributes-to-CRS-WKT-Elements should be preferred; these are derived from the OGP/EPSG registry of geodetic parameters, which is considered to represent the definitive authority as regards CRS property names and values.

From pyproj view:

  • converter is offered to convert coordinates and projection from CF to CRS and vice versa.

When importing a CRS from [...CF], you need both the grid mapping as well as the coordinate system. If you don’t use the coordinate system, then you will lose the units of your projection. [Except if] the CF crs_wkt attribute is available, the coordinate system is inside of the WKT and can be used to create the CRS in a single step.

Preliminary conclusions:

  • crs_wkt seems a good way to store coordinate system and projection information
  • to ensure CF compliant output files, both crs_wkt and single-property CF grid mapping attributes shall be set
  • output with CF compliant attributes can be created with pyproj's functionality

@observingClouds
Copy link
Contributor

observingClouds commented Nov 10, 2024

Based on brainstorming with @SimonKamuk the introduction of projection information raises a few sub-questions:

  • do we want to support CF-conform projection information of the input datasets?
  • do we want to ensure that the datasets created by mllam-data-prep contains CF-conform proj info?
  • do we want to be able to add/manipulate the projection information of the incoming dataset?
  • do we want to support several projections per input dataset?
  • do we want to support several projections per output dataset?
    • example: satellite observations and model variables on different projections
    • example: ocean model and atmospheric model on different grids

A bit of background information to answer these questions:

  • as mentioned in previous comment, CF requires at minimum single-property CF grid mapping attributes
  • crs_wkt is optional but WKT is best format to save/exchange projection information
  • crs information has to be assigned to each variable with grid_mapping attribute
  • grid_mapping attribute links variable to randomly named projection variable
    • note: projections are saved as variables (not global attributes)
    • note: crs varibables can have any name (not necessarily crs)

Implementation questions/suggestions:

  • How to get projection information
    • Metpy
      • Example
      • xr.Dataset().metpy.parse_cf() parses CF conform project information and allows access to cartopy_crs or pyproj_crs for each variable, e.g. ds[var].metpy.pyproj_crs
      • crs_wkt attribute will be used over single-entry CF attributes
      • introduces dependencies to several packages including matplotlib, pyproj, cartopy
    • Pyproj
      • Example
      • identification of projection variable needs to be done manually via checking each variable for the grid_mapping attribute
      • pyproj.CRS.from_cf() allows to create pyproj_crs from identified crs variable attributes
      • crs_wkt attribute will be used over single-entry CF attributes
      • pyproj_CRS.from_wkt() allows to interpret crs_wkt string directly
      • offers option to create cf-conform attribute dictionary from projection with crs.to_cf()
    • Xarray
      • Example
      • identifies crs variables based on grid_mapping attribute and assigns them as coordinates
      • cf_xarray
        - Example
        - is able to automatically get grid_mapping variables
        Based on these arguments the pyproj implementation seems to be the most lightweight solution. The check of the grid_mapping attribute can easily be implemented.
  • How to configure the input projection
    • Input file specifications
      • projection(s) are centrally defined as variables
      • each variable needs a grid_mapping attribute to be connected with a projection
      • if proj provided in input file, cf-conform single-value attributes or crs_wkt needs to be provided, with the later being prioritized
    • Config file implementation
      inputs:
        danra_surface:
          path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
          variables:
            u:
              coordinates:
                altitude:
                  values: [100,]
              attributes:  # will update ds.u.attrs
                grid_mapping: LambertConformal
          projections:
            LambertConformal:  # will become grid_mapping variable name
              dims: [x, y]
              kwargs:  # will be interpreted by pyproj.CRS.from_cf()
                crs_wkt: "GEODCRS['WGS 84',DATUM['World Geodetic System 1984', ELLIPSOID['WGS 84',6378137,..." # optional if single-attributes are provided
                central_longitude: 15.0  # optional if crs_wkt is given
                central_latitude: 63.0
                standard_parallels: [63.0, 63.0]
      
      This emphasizes again the need to restructure the variables section and intruce keys like coordinates.
  • How to write the projection information to the output file
    • identify variables which are linked to projections (grid_mapping attr present in input) with cf_xarray
    • ensure grid_mapping is written to output
    • use pyprojs crs.to_cf() to create CF-conform projection information for each projection
    • use xarray to create empty variable and attach output of crs.to_cf() as attributes (ds[crs.name].attrs = crs.to_cf())

Note

As one can see, adding cf-compliant projection information is not as straightforward as maybe initially thought. Nevertheless, thanks to the great work done in the external packages an implementation seems doable. It should be noted though that the libraries and specifications are still far from perfect and things will fail.

I created a small repository to test the different packages. My observations are:

  • Some of the CF-convention example CRSs are either incorrect or pyproj cannot read them/misses the database by default
  • reading cf-compliant crs input may result in different crs output when using pyproj.CRS.from_CF() and crs.to_cf(), mostly with additional attributes
  • the grid_mapping attribute can have different formats that are not yet always implemented, e.g. in xarray

@observingClouds
Copy link
Contributor

@SimonKamuk feel free to add/edit in case I missed something

@leifdenby
Copy link
Member Author

Thanks for this @observingClouds and @SimonKamuk! Really great overview.

In terms of the questions you ask I would suggest we try and stage the implementation.

do we want to support CF-conform projection information of the input datasets?
do we want to ensure that the datasets created by mllam-data-prep contains CF-conform proj info?
do we want to be able to add/manipulate the projection information of the incoming dataset?
do we want to support several projections per input dataset?
do we want to support several projections per output dataset?

How about the following plan:

  • first revision:
    • CF-conforming projection info in output of mllam-data-prep
    • Ability to define in mllam-data-prep config: projection info that will set (overwrite if present) for all selected variables that have the projection coordinates as dimensions
    • only support one projection in output, no support for reading projection from input
  • second revision:
    • add support for reading projection from input. Still only support one projection
  • third revision:
    • add support for reading/writing in multiple projections. Will need change of config, but also rethink of how mllam-data-prep works since it is currently assumed that all variables and datasets share the exact same coordinates

One thing I am unsure about:

  • Does it even make sense to try and store the projection info in the output of mllam-data-prep in a cf-compability manner? I ask because the dimensions used in the output will typically be [grid_index,time,state_feature], but the projection coordinates will be "auxiliary" coordinates, i.e. be set in ds.coords but not actually used as a dimension of any of the variables in the output.

@observingClouds
Copy link
Contributor

The plan sounds great!

Does it even make sense to try and store the projection info in the output of mllam-data-prep in a cf-compability manner?

Maybe I return this question and ask if we want to store this information in general. If so, then the CF-convention are a great solution even if we break them per variable (e.g. because they are auxiliary coords) but use the convention in the crs variable.

@leifdenby
Copy link
Member Author

Maybe I return this question and ask if we want to store this information in general. If so, then the CF-convention are a great solution even if we break them per variable (e.g. because they are auxiliary coords) but use the convention in the crs variable.

Yes, sorry. I meant to say whether it is actually possible, not whether it makes sense. Maybe you could start there to see if you simple set the correct attributes in the output from mllam-data-prep (for example just the example DANRA config file) and then get pyproj to parse the info? If that works then I think we have a plan 😄 Thanks again!

@leifdenby
Copy link
Member Author

leifdenby commented Nov 12, 2024

@observingClouds I didn't manage to comment on your config suggestion change. If we assume that all inputs use the same projection could we start with the following:

inputs:
  danra_surface:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
    variables:
      u:
        altitude:
          values: [100,]
    projection:
      dims: [x, y]
      crs_wkt: "GEODCRS['WGS 84',DATUM['World Geodetic System 1984', ELLIPSOID['WGS 84',6378137,..."

?

I appreciate that you would like to have inputs.{dataset_name}.coordinates and inputs.{dataset_name}.attributes as we're discussing in #30, but I feel a little unsure about changing (as in breaking existing configs) rather than introducing a new optional field to the config. We can also discuss this more if you'd like :) maybe we need voice it to more people to get some feedback.

@observingClouds
Copy link
Contributor

Well, I really would like to get the grid_mapping attribute set. I can assume that for now being set for each variable, but I think having the option to set that directly in the config makes stage 2 straight forward (is included basically for free). With keeping the config sections general and mimic xarrays settings (coords, encoding, attrs) the config remains simple and generalizes. Let me do the implementation PR and we can continue the discussion to see if my suggested change is a bad idea ;)

@observingClouds
Copy link
Contributor

@leifdenby #38 follows now mostly the above mentioned structure, with just two minor adjustments:

  • key name projections instead of projection (to allow for future non-breaking changes)
  • consequently an additional subsection with the projection name

In summary:

inputs:
  danra_surface:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
    variables:
      u:
        altitude:
          values: [100,]
    projections:
      my_crs_name:
          dims: [x, y]
          crs_wkt: "GEODCRS['WGS 84',DATUM['World Geodetic System 1984', ELLIPSOID['WGS 84',6378137,..."

@observingClouds
Copy link
Contributor

Because we do not only want to read and save complete CF-conform projection information, but also be able to plot the data on a given projection this issue and connected PR depend on upstream issues: SciTools/cartopy#2479

@observingClouds
Copy link
Contributor

After some deeper dive, it turns out that by providing a valid BBOX entry to the WKT strings (e.g. do not include the poles for cylindrical projections), plotable cartopy Projection objects can be created from WKT strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants