Add support for cropping domain with padded lat/lon convex hull #45

leifdenby · 2024-12-09T19:57:38Z

Describe your changes

The aim of this PR is to add support for cropping one dataset using the convex hull of lat/lon coordinates of grid-points in another dataset.

This is motivated by the need when doing LAM (limited area) based forecasting to be able to define forcing input in a padding region around the limited area. This could for example be when training on DANRA reanalysis one might want to force on the boundary with ERA5 reanalysis. This work builds on @TomasLandelius proof-of-concept implementation (mllam/weather-model-graphs#30) which takes the convex hull of the lat/lon coordinates of grid-points in the limited area (e.g. DANRA) domain, splits the points in the boundary domain by whether they are interior or external to this convex hull (again using the lat/lon coordinates) and defines points in the boundary region as those that are within a given distance (in radians from Earth's center) to the closest point within the convex hull.

So far I have taken the implementation from @TomasLandelius and structured into a notebook (and used DANRA and ERA5 zarr datasets in cloud storage).

Below I have made an example where ERA5 sample points are selected that are within 0.2 radians of the convex hull of grid-points in DANRA:

For the lat/lon convex hull calculations the spherical-geometry package is required. For plotting of course matplotlib and cartopy. I am thinking of putting the visualiation related packages in an optional [visualisation] extra requirements, but I am not sure if it makes sense to always require the spherical-geometry package.

Issue Link

Implements mllam/weather-model-graphs#30

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the documentation to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

leifdenby · 2024-12-10T19:49:22Z

This is getting there. Added some tests now so I can ensure the different steps of the algorithm works. Here's a nice plot for the convex-hull interior/exterior mask:

Separating out the mask calculation will also make it possible to add a cropping method that just crops using inside/outside the convex hull in future

TomasLandelius · 2024-12-10T20:13:30Z

Looks great! /Tomas

…

-------- Originalmeddelande -------- Från: Leif Denby ***@***.***> Datum: 2024-12-10 20:49 (GMT+01:00) Till: mllam/mllam-data-prep ***@***.***> Kopia: Landelius Tomas ***@***.***>, Mention ***@***.***> Ämne: Re: [mllam/mllam-data-prep] Add support for cropping domain with padded lat/lon convex hull (PR #45) This is getting there. Added some tests now so I can ensure the different steps of the algorithm works. Here's a nice plot for the convex-hull interior/exterior mask: billede.png (view on web)<https://github.com/user-attachments/assets/087309b0-7ffb-4e13-afd7-1054d01f4330> Separating out the mask calculation will also make it possible to add a cropping method that just crops using inside/outside the convex hull in future — Reply to this email directly, view it on GitHub<#45 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AY7HJCI4O5ONLUB5XXVZSU32E5AVTAVCNFSM6AAAAABTJS3AC2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZSG4ZDEOBUHE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

joeloskarsson · 2024-12-10T22:16:15Z

This is looking great! What's your idea for how to specify the dataset to base the mask on (i.e. where to get the coordinates from the interior region from)? Would that be an already processed mdp zarr, or any arbitrary zarr? Should the path to that be specified in the mdp config file file the boundary dataset?

leifdenby · 2024-12-12T22:20:17Z

This is looking great! What's your idea for how to specify the dataset to base the mask on (i.e. where to get the coordinates from the interior region from)? Would that be an already processed mdp zarr, or any arbitrary zarr? Should the path to that be specified in the mdp config file file the boundary dataset?

Thank you :)

What I have done so far is adapt the output part of the config to make it possible to point to the path of a config for a dataset that should be treated as the interior domain. Like so:

output:
  ...
  domain_cropping:
    margin_width_degrees: 0.2
    interior_dataset_config_path: example.danra.yaml

https://github.com/mllam/mllam-data-prep/pull/45/files#diff-33f36098ed98a22bbf317f860675cc821a3ccbb56e1650f087c8cc4b3b245038R30

I adapted @TomasLandelius code so that one can define the angle in degrees instead of radians (since I think that is a bit more intuitive to reason about). And I've left space for the possibility for other cropping methods to be defined later (for example one could use the convex hull to make two overlapping domains). I am very open to comments on the changes above though :)

Also, the test for DANRA + weatherbench ERA5 currently faisl because #33 isn't complete yet. I think it would be get that one in first. But other than that it is complete 🥳

joeloskarsson · 2024-12-15T02:08:46Z

docs/domain-cropping.ipynb

This file is great, however it is mainly a documentation of the implementation of domain-cropping, rather than a documentation of how to do it in practice (for a user). It is still good to have, but I think this might be good to clarify in here, that the user should not do something like what's in the file, but actually just specify a line in the config.

Yes, good point. I've mostly left this in here because I was using it during development and I thought the examples might be useful later. Maybe calling it documentation isn't quite right. In general though it would be nice with some documentation on how to use mllam_data_prep not from the command line but by importing the package. I haven't had a chance to write that yet, but this notebook was supposed to be a start along those lines.

joeloskarsson · 2024-12-15T02:13:59Z

docs/domain-cropping.ipynb

Document also the new config fields in the README

joeloskarsson · 2024-12-15T02:17:43Z

mllam_data_prep/ops/cropping.py

+    # only consider points that are external to the convex hull
+    ds_exterior = ds.where(~da_ch_mask, drop=True)


Now this always removes points that are within the convex hull. I would really want the functionality to still keep these (overlapping regions). It seems that that should not be a major step to implement as an option here? If not, I think we could try to do that in this PR as well. Just need to figure out a useful way to configure that in the yaml-file.

Yes, good point. A convenient way to do this might be to implement a number of different "methods" of cropping already, named something like:

convex_hull_margin: only includes points within a defined margin width of the convex hull

convex_hull_interior_and_margin: includes points within convex hull and within a defined margin width of the convex hull

convex_hull: only within the convex hull

Then one could refer to these names in the config. I would then create a config dataclass for each (say ConvexHullInteriorAndMargin etc) and there would be a separate function for each that calls the functions I have already written (with some refactoring make the choice of masking flexible in what you've identified above)

mllam_data_prep/ops/cropping.py

joeloskarsson · 2024-12-15T02:46:56Z

example.era5_cropped.yaml

+        end: 1990-09-09T00:00
+  domain_cropping:
+    margin_width_degrees: 0.2
+    interior_dataset_config_path: example.danra.yaml


Would it not be possible to have this (optionally) point to an already saved zarr, instead of a config file? If I understand correctly, this now rebuilds the whole reference xr.Dataset from the config given here, and uses that for the cropping. I understand that this doesn't actually download or save all the data again, but it still seems a bit unneccesary to do all the Xarray operations again, given that this would almost always be used after the reference data has already been processed and saved to disk as zarr.

Yes, in theory we could do that. I was a bit in doubt too about recomputing. But until we have a way to save in the produced dataset the config the dataset was built from (which is something we have previously discussed, but I don't think it has been worked on yet. I've added an issue for this #46) then I wasn't too sure about doing this (this feels like cache invalidation, which is always a tricky thing to get right). How about I add functionality which checks for a zarr dataset named after the config (e.g. example.danra.zarr for example.danra.yaml) and if it exists I use that instead but issue a warning to the user that we are not checking the contents? I can then remove that warning once we have functionality in place to check that the dataset on disk actually is made from the config file that we were hoping for.

joeloskarsson · 2024-12-15T03:18:10Z

mllam_data_prep/ops/cropping.py

+        return (da.latitude, da.longitude)
+    elif "lat" in da.coords and "lon" in da.coords:
+        return (da.lat, da.lon)


Better to return longitude, latitude ordering to be consistent with x, y ordering of rest of codebase.

I am not sure we can do that. In some cases the latitude and longitude will not just be variables of a dataset, but they can be coordinates themselves (as is the case for the weatherbench ERA5 dataset) and so they don't have set of coordinates (like x and y). Or said differently, the latitude and longitude values are 1D arrays that span the 2D spatial shape of the fields. As I am writing the code here I am trying to make it general so that it would work independently of the underlying coordinates (as in xr.DataArray coords) that the latitude and longitude values are given on. What I have written now a bit brittle because it relies on the name of the variables containing the lat/lon values being named lat/lon or latitude/longitude. The better way would be to require a the cf-convention attribute that is used for indicating what variables contain the lat/lon values, but that is probably more work than we want to do right now.

I'm not sure what you mean above, I am just talking about the ordering of return arguments

Suggested change

return (da.latitude, da.longitude)

elif "lat" in da.coords and "lon" in da.coords:

return (da.lat, da.lon)

return (da.longitude, da.latitude)

elif "lat" in da.coords and "lon" in da.coords:

return (da.lon, da.lat)

(and swapping that ordering in later assignments that call this function)

Ah I see :) I thought you meant the order of the coordinates with the xr.DataArrays that are returned. Ok yes, we can change that, I just thought it made sense to have that order since people say "lat/lon-values", but we can make it explicit that the longitude are the first elements of the tuple

joeloskarsson · 2024-12-15T03:23:26Z

mllam_data_prep/ops/cropping.py

+    xarray.DataArray
+        A boolean mask indicating which points are interior to the convex hull.
+    """
+    da_lat, da_lon = _get_latlon_coords(ds)


When testing this out I get an error here. Now the example ERA5 config renames latitude+longitude to y+x, which I think is needed to make the output work with neural-lam? It does however mean that there are no lat/lon in the coords of the dataset here.

Diff for local changes I had to do to get it running. With this I got rid of the errors and was able to save an ERA5 zarr.

--- a/mllam_data_prep/ops/cropping.py +++ b/mllam_data_prep/ops/cropping.py @@ -22,6 +22,8 @@ def _get_latlon_coords(da: xr.DataArray) -> tuple: return (da.latitude, da.longitude) elif "lat" in da.coords and "lon" in da.coords: return (da.lat, da.lon) + elif "y" in da.coords and "x" in da.coords: + return (da.y, da.x) else: import ipdb @@ -132,9 +134,9 @@ def distance_to_convex_hull_boundary( # only consider points that are external to the convex hull ds_exterior = ds.where(~da_ch_mask, drop=True) - da_xyz = _latlon_to_unit_sphere_xyz(ds_exterior.lat, ds_exterior.lon) + da_xyz = _latlon_to_unit_sphere_xyz(ds_exterior.y, ds_exterior.x) da_xyz_ref = _latlon_to_unit_sphere_xyz( - ds_reference_separate_gridindex.lat, ds_reference_separate_gridindex.lon + ds_reference_separate_gridindex.y, ds_reference_separate_gridindex.x )

Above makes the code run with the config, but the distance computation then breaks. I can't figure out why, but I think it relates to mixing of xy and latlon coordinates somewhere.

Thanks! Assuming that the latitude and longitude values are available as .lat and .lot on the dataset was a mistake (I meant to wrap all the access to the latitude and longitude coordinates with the _get_latlon_coords function, but missed this one. If I had implemented the testing before I would have caught this...)

The code in neural-lam actually sets the cartesian coordinates (or maybe they should be called "regular grid coordinates") when loading the dataset: https://github.com/mllam/neural-lam/blob/main/neural_lam/datastore/mdp.py#L108 and uses the config to work out the name of the coordinates that were stacked. So the coordinates that were stacked don't have to be renamed to x and y (that is the idea anyway)

Note that in neural-lam the x and y coords are also explicitly referenced (e.g. https://github.com/mllam/neural-lam/blob/1a128266fd144eada9d8e74724402c6e0e84267d/neural_lam/datastore/mdp.py#L446), making them a requirement as is.

mllam_data_prep/ops/cropping.py

joeloskarsson · 2024-12-16T00:14:59Z

mllam_data_prep/ops/cropping.py

+    def calc_dist(da_pt_xyz):
+        dotproduct = np.dot(da_xyz_ref, da_pt_xyz.T)
+        val = np.min(np.arccos(dotproduct))
+        return val
+
+    da_mindist_to_ref = xr.apply_ufunc(
+        calc_dist,
+        da_xyz,
+        input_core_dims=[["xyz"]],
+        output_core_dims=[[]],
+        vectorize=True,
+    )


I'm scared about the scaling of this, being $O(NM)$ (number of points in both sets). I understand that this can be parallelized, but we can do better algorithmically.

Once we have computed the convex hull, why don't we compute the distance to this polygon rather than to all points? This does not seem directly implemented in spherical_geometry, but maybe https://spherely.readthedocs.io/en/latest/_api_generated/spherely.distance.html#spherely.distance does this correctly?

Alternatively, if we can't measure the distance to the hull, can we do some kd-tree for this, e.g. with https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html + haversine distance?

We can get the points defining the polygon from chull_lam_lon, chull_lam_lat = list(chull_lam.to_lonlat())[0]. Then we only need to compute minimum distance from (a reasonable subset of) the points in the border to the arcs defined by the polygon segments defining the hull. I think the minimum distance can be computed with something like:

def shortest_distance_to_arc(P_lat, P_lon, A_lat, A_lon, B_lat, B_lon, radius=6371): P = lat_lon_to_cartesian(P_lat, P_lon) A = lat_lon_to_cartesian(A_lat, A_lon) B = lat_lon_to_cartesian(B_lat, B_lon) # Calculate normal vector to the plane of the great circle n = np.cross(A, B) n = n / np.linalg.norm(n) # Normalize # Project P onto the plane P_proj = P - np.dot(P, n)[:, np.newaxis] * n # Normalize to get Q on the sphere's surface Q = P_proj / np.linalg.norm(P_proj, axis=1)[:, np.newaxis] # Calculate the angle between P and Q theta = np.arccos(np.clip(np.sum(P * Q, axis=1), -1, 1)) # Check if Q is between A and B Q_between = (np.dot(np.cross(A, Q), n) >= 0) & (np.dot(np.cross(Q, B), n) >= 0) # Calculate distances PA and PB PA = np.arccos(np.clip(np.dot(P, A), -1, 1)) PB = np.arccos(np.clip(np.dot(P, B), -1, 1)) # Choose the appropriate distance distances = np.where(Q_between, theta, np.minimum(PA, PB)) return radius * distances

Some technical admin is needed to loop over the border point and the segments and find the minimum. Worth the effort?

Yes, agreed. It would be good to optimise this. I was going for creating a working implementation with what we have first, and then a later PR could do the extra work to optimise parts of this.

I was thinking that I would take quick look to see if I could get the just the coordinates of the polygon though (as you also suggested) and implement that here if it turns out to be easy

I think that just getting the coordinates of the points that make up the polygon and restricting the distance computation to those would be an optimization that's completely sufficient. Great if you can give that a try. If it ends up being too complicated I am also fine with just optimizing this later.

joeloskarsson · 2024-12-16T04:28:52Z

mllam_data_prep/create_dataset.py

+        ds = crop_to_within_convex_hull_margin(
+            ds=ds,
+            ds_reference=ds_interior_domain,
+            max_dist=domain_cropping.margin_width_degrees,
+        )


At the moment this seems to introduce a grid_index coordinate for the ds["splits"] variable, that should not be there. This also causes errors downstream in neural-lam.

Digging a bit more here I am not sure if it actually is the cropping that causes this. It still is an issue, but I can't figure out what causes it.

I will look into it

add WIP notebook on technique from Tomas Landelius

8f544b8

joeloskarsson added this to the v0.7.0 (proposed) milestone Dec 10, 2024

leifdenby added 2 commits December 10, 2024 17:04

expose xlim/ylim and add config creation for testdata

9a3b5ad

convex hull mask creation as DataArray complete

4cda642

leifdenby added 3 commits December 12, 2024 12:11

refactor complete

bb80db7

add latlon domain crop deps to ci/cd action

aed4d22

implement cropping during dataset creation

972186d

joeloskarsson reviewed Dec 15, 2024

View reviewed changes

mllam_data_prep/ops/cropping.py Outdated Show resolved Hide resolved

joeloskarsson reviewed Dec 15, 2024

View reviewed changes

mllam_data_prep/ops/cropping.py Outdated Show resolved Hide resolved

joeloskarsson reviewed Dec 16, 2024

View reviewed changes

fix tests, examples and missing use of lat/lon function

ba46718

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for cropping domain with padded lat/lon convex hull #45

Add support for cropping domain with padded lat/lon convex hull #45

leifdenby commented Dec 9, 2024

leifdenby commented Dec 10, 2024

TomasLandelius commented Dec 10, 2024 via email

joeloskarsson commented Dec 10, 2024

leifdenby commented Dec 12, 2024 •

edited

Loading

joeloskarsson Dec 15, 2024

leifdenby Dec 16, 2024

joeloskarsson Dec 15, 2024

joeloskarsson Dec 15, 2024

leifdenby Dec 16, 2024 •

edited

Loading

joeloskarsson Dec 15, 2024

leifdenby Dec 16, 2024 •

edited

Loading

joeloskarsson Dec 15, 2024

leifdenby Dec 16, 2024

joeloskarsson Dec 16, 2024

leifdenby Dec 17, 2024

joeloskarsson Dec 15, 2024

joeloskarsson Dec 15, 2024

joeloskarsson Dec 16, 2024

leifdenby Dec 16, 2024

joeloskarsson Dec 17, 2024

joeloskarsson Dec 16, 2024

TomasLandelius Dec 16, 2024

leifdenby Dec 16, 2024

leifdenby Dec 16, 2024

joeloskarsson Dec 17, 2024

joeloskarsson Dec 16, 2024

joeloskarsson Dec 16, 2024

leifdenby Dec 16, 2024

		# only consider points that are external to the convex hull
		ds_exterior = ds.where(~da_ch_mask, drop=True)

Add support for cropping domain with padded lat/lon convex hull #45

Are you sure you want to change the base?

Add support for cropping domain with padded lat/lon convex hull #45

Conversation

leifdenby commented Dec 9, 2024

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

leifdenby commented Dec 10, 2024

TomasLandelius commented Dec 10, 2024 via email

joeloskarsson commented Dec 10, 2024

leifdenby commented Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leifdenby Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leifdenby Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leifdenby commented Dec 12, 2024 •

edited

Loading

leifdenby Dec 16, 2024 •

edited

Loading

leifdenby Dec 16, 2024 •

edited

Loading