Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for cropping domain with padded lat/lon convex hull #45

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

leifdenby
Copy link
Member

Describe your changes

The aim of this PR is to add support for cropping one dataset using the convex hull of lat/lon coordinates of grid-points in another dataset.

This is motivated by the need when doing LAM (limited area) based forecasting to be able to define forcing input in a padding region around the limited area. This could for example be when training on DANRA reanalysis one might want to force on the boundary with ERA5 reanalysis. This work builds on @TomasLandelius proof-of-concept implementation (mllam/weather-model-graphs#30) which takes the convex hull of the lat/lon coordinates of grid-points in the limited area (e.g. DANRA) domain, splits the points in the boundary domain by whether they are interior or external to this convex hull (again using the lat/lon coordinates) and defines points in the boundary region as those that are within a given distance (in radians from Earth's center) to the closest point within the convex hull.

So far I have taken the implementation from @TomasLandelius and structured into a notebook (and used DANRA and ERA5 zarr datasets in cloud storage).

Below I have made an example where ERA5 sample points are selected that are within 0.2 radians of the convex hull of grid-points in DANRA:

output

For the lat/lon convex hull calculations the spherical-geometry package is required. For plotting of course matplotlib and cartopy. I am thinking of putting the visualiation related packages in an optional [visualisation] extra requirements, but I am not sure if it makes sense to always require the spherical-geometry package.

Issue Link

Implements mllam/weather-model-graphs#30

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the documentation to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • author has added an entry to the changelog (and designated the change as added, changed or fixed)
  • Once the PR is ready to be merged, squash commits and merge the PR.

@joeloskarsson joeloskarsson added this to the v0.7.0 (proposed) milestone Dec 10, 2024
@leifdenby
Copy link
Member Author

This is getting there. Added some tests now so I can ensure the different steps of the algorithm works. Here's a nice plot for the convex-hull interior/exterior mask:

billede

Separating out the mask calculation will also make it possible to add a cropping method that just crops using inside/outside the convex hull in future

@TomasLandelius
Copy link

TomasLandelius commented Dec 10, 2024 via email

@joeloskarsson
Copy link
Contributor

This is looking great! What's your idea for how to specify the dataset to base the mask on (i.e. where to get the coordinates from the interior region from)? Would that be an already processed mdp zarr, or any arbitrary zarr? Should the path to that be specified in the mdp config file file the boundary dataset?

@leifdenby
Copy link
Member Author

leifdenby commented Dec 12, 2024

This is looking great! What's your idea for how to specify the dataset to base the mask on (i.e. where to get the coordinates from the interior region from)? Would that be an already processed mdp zarr, or any arbitrary zarr? Should the path to that be specified in the mdp config file file the boundary dataset?

Thank you :)

What I have done so far is adapt the output part of the config to make it possible to point to the path of a config for a dataset that should be treated as the interior domain. Like so:

output:
  ...
  domain_cropping:
    margin_width_degrees: 0.2
    interior_dataset_config_path: example.danra.yaml

https://github.com/mllam/mllam-data-prep/pull/45/files#diff-33f36098ed98a22bbf317f860675cc821a3ccbb56e1650f087c8cc4b3b245038R30

I adapted @TomasLandelius code so that one can define the angle in degrees instead of radians (since I think that is a bit more intuitive to reason about). And I've left space for the possibility for other cropping methods to be defined later (for example one could use the convex hull to make two overlapping domains). I am very open to comments on the changes above though :)

Also, the test for DANRA + weatherbench ERA5 currently faisl because #33 isn't complete yet. I think it would be get that one in first. But other than that it is complete 🥳

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is great, however it is mainly a documentation of the implementation of domain-cropping, rather than a documentation of how to do it in practice (for a user). It is still good to have, but I think this might be good to clarify in here, that the user should not do something like what's in the file, but actually just specify a line in the config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I've mostly left this in here because I was using it during development and I thought the examples might be useful later. Maybe calling it documentation isn't quite right. In general though it would be nice with some documentation on how to use mllam_data_prep not from the command line but by importing the package. I haven't had a chance to write that yet, but this notebook was supposed to be a start along those lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document also the new config fields in the README

Comment on lines +132 to +133
# only consider points that are external to the convex hull
ds_exterior = ds.where(~da_ch_mask, drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this always removes points that are within the convex hull. I would really want the functionality to still keep these (overlapping regions). It seems that that should not be a major step to implement as an option here? If not, I think we could try to do that in this PR as well. Just need to figure out a useful way to configure that in the yaml-file.

Copy link
Member Author

@leifdenby leifdenby Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. A convenient way to do this might be to implement a number of different "methods" of cropping already, named something like:

  • convex_hull_margin: only includes points within a defined margin width of the convex hull
  • convex_hull_interior_and_margin: includes points within convex hull and within a defined margin width of the convex hull
  • convex_hull: only within the convex hull

Then one could refer to these names in the config. I would then create a config dataclass for each (say ConvexHullInteriorAndMargin etc) and there would be a separate function for each that calls the functions I have already written (with some refactoring make the choice of masking flexible in what you've identified above)

end: 1990-09-09T00:00
domain_cropping:
margin_width_degrees: 0.2
interior_dataset_config_path: example.danra.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be possible to have this (optionally) point to an already saved zarr, instead of a config file? If I understand correctly, this now rebuilds the whole reference xr.Dataset from the config given here, and uses that for the cropping. I understand that this doesn't actually download or save all the data again, but it still seems a bit unneccesary to do all the Xarray operations again, given that this would almost always be used after the reference data has already been processed and saved to disk as zarr.

Copy link
Member Author

@leifdenby leifdenby Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in theory we could do that. I was a bit in doubt too about recomputing. But until we have a way to save in the produced dataset the config the dataset was built from (which is something we have previously discussed, but I don't think it has been worked on yet. I've added an issue for this #46) then I wasn't too sure about doing this (this feels like cache invalidation, which is always a tricky thing to get right). How about I add functionality which checks for a zarr dataset named after the config (e.g. example.danra.zarr for example.danra.yaml) and if it exists I use that instead but issue a warning to the user that we are not checking the contents? I can then remove that warning once we have functionality in place to check that the dataset on disk actually is made from the config file that we were hoping for.

Comment on lines +22 to +24
return (da.latitude, da.longitude)
elif "lat" in da.coords and "lon" in da.coords:
return (da.lat, da.lon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to return longitude, latitude ordering to be consistent with x, y ordering of rest of codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we can do that. In some cases the latitude and longitude will not just be variables of a dataset, but they can be coordinates themselves (as is the case for the weatherbench ERA5 dataset) and so they don't have set of coordinates (like x and y). Or said differently, the latitude and longitude values are 1D arrays that span the 2D spatial shape of the fields. As I am writing the code here I am trying to make it general so that it would work independently of the underlying coordinates (as in xr.DataArray coords) that the latitude and longitude values are given on. What I have written now a bit brittle because it relies on the name of the variables containing the lat/lon values being named lat/lon or latitude/longitude. The better way would be to require a the cf-convention attribute that is used for indicating what variables contain the lat/lon values, but that is probably more work than we want to do right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean above, I am just talking about the ordering of return arguments

Suggested change
return (da.latitude, da.longitude)
elif "lat" in da.coords and "lon" in da.coords:
return (da.lat, da.lon)
return (da.longitude, da.latitude)
elif "lat" in da.coords and "lon" in da.coords:
return (da.lon, da.lat)

(and swapping that ordering in later assignments that call this function)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see :) I thought you meant the order of the coordinates with the xr.DataArrays that are returned. Ok yes, we can change that, I just thought it made sense to have that order since people say "lat/lon-values", but we can make it explicit that the longitude are the first elements of the tuple

xarray.DataArray
A boolean mask indicating which points are interior to the convex hull.
"""
da_lat, da_lon = _get_latlon_coords(ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing this out I get an error here. Now the example ERA5 config renames latitude+longitude to y+x, which I think is needed to make the output work with neural-lam? It does however mean that there are no lat/lon in the coords of the dataset here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff for local changes I had to do to get it running. With this I got rid of the errors and was able to save an ERA5 zarr.

--- a/mllam_data_prep/ops/cropping.py
+++ b/mllam_data_prep/ops/cropping.py
@@ -22,6 +22,8 @@ def _get_latlon_coords(da: xr.DataArray) -> tuple:
         return (da.latitude, da.longitude)
     elif "lat" in da.coords and "lon" in da.coords:
         return (da.lat, da.lon)
+    elif "y" in da.coords and "x" in da.coords:
+        return (da.y, da.x)
     else:
         import ipdb
 
@@ -132,9 +134,9 @@ def distance_to_convex_hull_boundary(
     # only consider points that are external to the convex hull
     ds_exterior = ds.where(~da_ch_mask, drop=True)
 
-    da_xyz = _latlon_to_unit_sphere_xyz(ds_exterior.lat, ds_exterior.lon)
+    da_xyz = _latlon_to_unit_sphere_xyz(ds_exterior.y, ds_exterior.x)
     da_xyz_ref = _latlon_to_unit_sphere_xyz(
-        ds_reference_separate_gridindex.lat, ds_reference_separate_gridindex.lon
+        ds_reference_separate_gridindex.y, ds_reference_separate_gridindex.x
     )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above makes the code run with the config, but the distance computation then breaks. I can't figure out why, but I think it relates to mixing of xy and latlon coordinates somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Assuming that the latitude and longitude values are available as .lat and .lot on the dataset was a mistake (I meant to wrap all the access to the latitude and longitude coordinates with the _get_latlon_coords function, but missed this one. If I had implemented the testing before I would have caught this...)

The code in neural-lam actually sets the cartesian coordinates (or maybe they should be called "regular grid coordinates") when loading the dataset: https://github.com/mllam/neural-lam/blob/main/neural_lam/datastore/mdp.py#L108 and uses the config to work out the name of the coordinates that were stacked. So the coordinates that were stacked don't have to be renamed to x and y (that is the idea anyway)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in neural-lam the x and y coords are also explicitly referenced (e.g. https://github.com/mllam/neural-lam/blob/1a128266fd144eada9d8e74724402c6e0e84267d/neural_lam/datastore/mdp.py#L446), making them a requirement as is.

Comment on lines +140 to +151
def calc_dist(da_pt_xyz):
dotproduct = np.dot(da_xyz_ref, da_pt_xyz.T)
val = np.min(np.arccos(dotproduct))
return val

da_mindist_to_ref = xr.apply_ufunc(
calc_dist,
da_xyz,
input_core_dims=[["xyz"]],
output_core_dims=[[]],
vectorize=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm scared about the scaling of this, being $O(NM)$ (number of points in both sets). I understand that this can be parallelized, but we can do better algorithmically.

Once we have computed the convex hull, why don't we compute the distance to this polygon rather than to all points? This does not seem directly implemented in spherical_geometry, but maybe https://spherely.readthedocs.io/en/latest/_api_generated/spherely.distance.html#spherely.distance does this correctly?

Alternatively, if we can't measure the distance to the hull, can we do some kd-tree for this, e.g. with https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html + haversine distance?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get the points defining the polygon from chull_lam_lon, chull_lam_lat = list(chull_lam.to_lonlat())[0]. Then we only need to compute minimum distance from (a reasonable subset of) the points in the border to the arcs defined by the polygon segments defining the hull. I think the minimum distance can be computed with something like:

def shortest_distance_to_arc(P_lat, P_lon, A_lat, A_lon, B_lat, B_lon, radius=6371):
    P = lat_lon_to_cartesian(P_lat, P_lon)
    A = lat_lon_to_cartesian(A_lat, A_lon)
    B = lat_lon_to_cartesian(B_lat, B_lon)
    
    # Calculate normal vector to the plane of the great circle
    n = np.cross(A, B)
    n = n / np.linalg.norm(n)  # Normalize
    
    # Project P onto the plane
    P_proj = P - np.dot(P, n)[:, np.newaxis] * n
    
    # Normalize to get Q on the sphere's surface
    Q = P_proj / np.linalg.norm(P_proj, axis=1)[:, np.newaxis]
    
    # Calculate the angle between P and Q
    theta = np.arccos(np.clip(np.sum(P * Q, axis=1), -1, 1))
    
    # Check if Q is between A and B
    Q_between = (np.dot(np.cross(A, Q), n) >= 0) & (np.dot(np.cross(Q, B), n) >= 0)
    
    # Calculate distances PA and PB
    PA = np.arccos(np.clip(np.dot(P, A), -1, 1))
    PB = np.arccos(np.clip(np.dot(P, B), -1, 1))
    
    # Choose the appropriate distance
    distances = np.where(Q_between, theta, np.minimum(PA, PB))
    
    return radius * distances

Some technical admin is needed to loop over the border point and the segments and find the minimum. Worth the effort?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. It would be good to optimise this. I was going for creating a working implementation with what we have first, and then a later PR could do the extra work to optimise parts of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that I would take quick look to see if I could get the just the coordinates of the polygon though (as you also suggested) and implement that here if it turns out to be easy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that just getting the coordinates of the points that make up the polygon and restricting the distance computation to those would be an optimization that's completely sufficient. Great if you can give that a try. If it ends up being too complicated I am also fine with just optimizing this later.

Comment on lines +256 to +260
ds = crop_to_within_convex_hull_margin(
ds=ds,
ds_reference=ds_interior_domain,
max_dist=domain_cropping.margin_width_degrees,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment this seems to introduce a grid_index coordinate for the ds["splits"] variable, that should not be there. This also causes errors downstream in neural-lam.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging a bit more here I am not sure if it actually is the cropping that causes this. It still is an issue, but I can't figure out what causes it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants