-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for cropping domain with padded lat/lon convex hull #45
base: main
Are you sure you want to change the base?
Conversation
This is getting there. Added some tests now so I can ensure the different steps of the algorithm works. Here's a nice plot for the convex-hull interior/exterior mask: Separating out the mask calculation will also make it possible to add a cropping method that just crops using inside/outside the convex hull in future |
Looks great!
/Tomas
…-------- Originalmeddelande --------
Från: Leif Denby ***@***.***>
Datum: 2024-12-10 20:49 (GMT+01:00)
Till: mllam/mllam-data-prep ***@***.***>
Kopia: Landelius Tomas ***@***.***>, Mention ***@***.***>
Ämne: Re: [mllam/mllam-data-prep] Add support for cropping domain with padded lat/lon convex hull (PR #45)
This is getting there. Added some tests now so I can ensure the different steps of the algorithm works. Here's a nice plot for the convex-hull interior/exterior mask:
billede.png (view on web)<https://github.com/user-attachments/assets/087309b0-7ffb-4e13-afd7-1054d01f4330>
Separating out the mask calculation will also make it possible to add a cropping method that just crops using inside/outside the convex hull in future
—
Reply to this email directly, view it on GitHub<#45 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AY7HJCI4O5ONLUB5XXVZSU32E5AVTAVCNFSM6AAAAABTJS3AC2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZSG4ZDEOBUHE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
This is looking great! What's your idea for how to specify the dataset to base the mask on (i.e. where to get the coordinates from the interior region from)? Would that be an already processed mdp zarr, or any arbitrary zarr? Should the path to that be specified in the mdp config file file the boundary dataset? |
Thank you :) What I have done so far is adapt the output:
...
domain_cropping:
margin_width_degrees: 0.2
interior_dataset_config_path: example.danra.yaml I adapted @TomasLandelius code so that one can define the angle in degrees instead of radians (since I think that is a bit more intuitive to reason about). And I've left space for the possibility for other cropping methods to be defined later (for example one could use the convex hull to make two overlapping domains). I am very open to comments on the changes above though :) Also, the test for DANRA + weatherbench ERA5 currently faisl because #33 isn't complete yet. I think it would be get that one in first. But other than that it is complete 🥳 |
docs/domain-cropping.ipynb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is great, however it is mainly a documentation of the implementation of domain-cropping, rather than a documentation of how to do it in practice (for a user). It is still good to have, but I think this might be good to clarify in here, that the user should not do something like what's in the file, but actually just specify a line in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I've mostly left this in here because I was using it during development and I thought the examples might be useful later. Maybe calling it documentation isn't quite right. In general though it would be nice with some documentation on how to use mllam_data_prep
not from the command line but by importing the package. I haven't had a chance to write that yet, but this notebook was supposed to be a start along those lines.
docs/domain-cropping.ipynb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document also the new config fields in the README
# only consider points that are external to the convex hull | ||
ds_exterior = ds.where(~da_ch_mask, drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this always removes points that are within the convex hull. I would really want the functionality to still keep these (overlapping regions). It seems that that should not be a major step to implement as an option here? If not, I think we could try to do that in this PR as well. Just need to figure out a useful way to configure that in the yaml-file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. A convenient way to do this might be to implement a number of different "methods" of cropping already, named something like:
convex_hull_margin
: only includes points within a defined margin width of the convex hullconvex_hull_interior_and_margin
: includes points within convex hull and within a defined margin width of the convex hullconvex_hull
: only within the convex hull
Then one could refer to these names in the config. I would then create a config dataclass for each (say ConvexHullInteriorAndMargin
etc) and there would be a separate function for each that calls the functions I have already written (with some refactoring make the choice of masking flexible in what you've identified above)
end: 1990-09-09T00:00 | ||
domain_cropping: | ||
margin_width_degrees: 0.2 | ||
interior_dataset_config_path: example.danra.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not be possible to have this (optionally) point to an already saved zarr, instead of a config file? If I understand correctly, this now rebuilds the whole reference xr.Dataset
from the config given here, and uses that for the cropping. I understand that this doesn't actually download or save all the data again, but it still seems a bit unneccesary to do all the Xarray operations again, given that this would almost always be used after the reference data has already been processed and saved to disk as zarr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in theory we could do that. I was a bit in doubt too about recomputing. But until we have a way to save in the produced dataset the config the dataset was built from (which is something we have previously discussed, but I don't think it has been worked on yet. I've added an issue for this #46) then I wasn't too sure about doing this (this feels like cache invalidation, which is always a tricky thing to get right). How about I add functionality which checks for a zarr dataset named after the config (e.g. example.danra.zarr
for example.danra.yaml
) and if it exists I use that instead but issue a warning to the user that we are not checking the contents? I can then remove that warning once we have functionality in place to check that the dataset on disk actually is made from the config file that we were hoping for.
return (da.latitude, da.longitude) | ||
elif "lat" in da.coords and "lon" in da.coords: | ||
return (da.lat, da.lon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to return longitude, latitude ordering to be consistent with x, y ordering of rest of codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we can do that. In some cases the latitude and longitude will not just be variables of a dataset, but they can be coordinates themselves (as is the case for the weatherbench ERA5 dataset) and so they don't have set of coordinates (like x
and y
). Or said differently, the latitude and longitude values are 1D arrays that span the 2D spatial shape of the fields. As I am writing the code here I am trying to make it general so that it would work independently of the underlying coordinates (as in xr.DataArray
coords) that the latitude and longitude values are given on. What I have written now a bit brittle because it relies on the name of the variables containing the lat/lon values being named lat
/lon
or latitude
/longitude
. The better way would be to require a the cf-convention attribute that is used for indicating what variables contain the lat/lon values, but that is probably more work than we want to do right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean above, I am just talking about the ordering of return arguments
return (da.latitude, da.longitude) | |
elif "lat" in da.coords and "lon" in da.coords: | |
return (da.lat, da.lon) | |
return (da.longitude, da.latitude) | |
elif "lat" in da.coords and "lon" in da.coords: | |
return (da.lon, da.lat) |
(and swapping that ordering in later assignments that call this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see :) I thought you meant the order of the coordinates with the xr.DataArray
s that are returned. Ok yes, we can change that, I just thought it made sense to have that order since people say "lat/lon-values", but we can make it explicit that the longitude are the first elements of the tuple
xarray.DataArray | ||
A boolean mask indicating which points are interior to the convex hull. | ||
""" | ||
da_lat, da_lon = _get_latlon_coords(ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When testing this out I get an error here. Now the example ERA5 config renames latitude+longitude to y+x, which I think is needed to make the output work with neural-lam? It does however mean that there are no lat/lon in the coords of the dataset here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diff for local changes I had to do to get it running. With this I got rid of the errors and was able to save an ERA5 zarr.
--- a/mllam_data_prep/ops/cropping.py
+++ b/mllam_data_prep/ops/cropping.py
@@ -22,6 +22,8 @@ def _get_latlon_coords(da: xr.DataArray) -> tuple:
return (da.latitude, da.longitude)
elif "lat" in da.coords and "lon" in da.coords:
return (da.lat, da.lon)
+ elif "y" in da.coords and "x" in da.coords:
+ return (da.y, da.x)
else:
import ipdb
@@ -132,9 +134,9 @@ def distance_to_convex_hull_boundary(
# only consider points that are external to the convex hull
ds_exterior = ds.where(~da_ch_mask, drop=True)
- da_xyz = _latlon_to_unit_sphere_xyz(ds_exterior.lat, ds_exterior.lon)
+ da_xyz = _latlon_to_unit_sphere_xyz(ds_exterior.y, ds_exterior.x)
da_xyz_ref = _latlon_to_unit_sphere_xyz(
- ds_reference_separate_gridindex.lat, ds_reference_separate_gridindex.lon
+ ds_reference_separate_gridindex.y, ds_reference_separate_gridindex.x
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above makes the code run with the config, but the distance computation then breaks. I can't figure out why, but I think it relates to mixing of xy and latlon coordinates somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Assuming that the latitude and longitude values are available as .lat
and .lot
on the dataset was a mistake (I meant to wrap all the access to the latitude and longitude coordinates with the _get_latlon_coords
function, but missed this one. If I had implemented the testing before I would have caught this...)
The code in neural-lam actually sets the cartesian coordinates (or maybe they should be called "regular grid coordinates") when loading the dataset: https://github.com/mllam/neural-lam/blob/main/neural_lam/datastore/mdp.py#L108 and uses the config to work out the name of the coordinates that were stacked. So the coordinates that were stacked don't have to be renamed to x
and y
(that is the idea anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in neural-lam the x
and y
coords are also explicitly referenced (e.g. https://github.com/mllam/neural-lam/blob/1a128266fd144eada9d8e74724402c6e0e84267d/neural_lam/datastore/mdp.py#L446), making them a requirement as is.
def calc_dist(da_pt_xyz): | ||
dotproduct = np.dot(da_xyz_ref, da_pt_xyz.T) | ||
val = np.min(np.arccos(dotproduct)) | ||
return val | ||
|
||
da_mindist_to_ref = xr.apply_ufunc( | ||
calc_dist, | ||
da_xyz, | ||
input_core_dims=[["xyz"]], | ||
output_core_dims=[[]], | ||
vectorize=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm scared about the scaling of this, being
Once we have computed the convex hull, why don't we compute the distance to this polygon rather than to all points? This does not seem directly implemented in spherical_geometry
, but maybe https://spherely.readthedocs.io/en/latest/_api_generated/spherely.distance.html#spherely.distance does this correctly?
Alternatively, if we can't measure the distance to the hull, can we do some kd-tree for this, e.g. with https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html + haversine distance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get the points defining the polygon from chull_lam_lon, chull_lam_lat = list(chull_lam.to_lonlat())[0]
. Then we only need to compute minimum distance from (a reasonable subset of) the points in the border to the arcs defined by the polygon segments defining the hull. I think the minimum distance can be computed with something like:
def shortest_distance_to_arc(P_lat, P_lon, A_lat, A_lon, B_lat, B_lon, radius=6371):
P = lat_lon_to_cartesian(P_lat, P_lon)
A = lat_lon_to_cartesian(A_lat, A_lon)
B = lat_lon_to_cartesian(B_lat, B_lon)
# Calculate normal vector to the plane of the great circle
n = np.cross(A, B)
n = n / np.linalg.norm(n) # Normalize
# Project P onto the plane
P_proj = P - np.dot(P, n)[:, np.newaxis] * n
# Normalize to get Q on the sphere's surface
Q = P_proj / np.linalg.norm(P_proj, axis=1)[:, np.newaxis]
# Calculate the angle between P and Q
theta = np.arccos(np.clip(np.sum(P * Q, axis=1), -1, 1))
# Check if Q is between A and B
Q_between = (np.dot(np.cross(A, Q), n) >= 0) & (np.dot(np.cross(Q, B), n) >= 0)
# Calculate distances PA and PB
PA = np.arccos(np.clip(np.dot(P, A), -1, 1))
PB = np.arccos(np.clip(np.dot(P, B), -1, 1))
# Choose the appropriate distance
distances = np.where(Q_between, theta, np.minimum(PA, PB))
return radius * distances
Some technical admin is needed to loop over the border point and the segments and find the minimum. Worth the effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed. It would be good to optimise this. I was going for creating a working implementation with what we have first, and then a later PR could do the extra work to optimise parts of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that I would take quick look to see if I could get the just the coordinates of the polygon though (as you also suggested) and implement that here if it turns out to be easy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that just getting the coordinates of the points that make up the polygon and restricting the distance computation to those would be an optimization that's completely sufficient. Great if you can give that a try. If it ends up being too complicated I am also fine with just optimizing this later.
ds = crop_to_within_convex_hull_margin( | ||
ds=ds, | ||
ds_reference=ds_interior_domain, | ||
max_dist=domain_cropping.margin_width_degrees, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment this seems to introduce a grid_index
coordinate for the ds["splits"]
variable, that should not be there. This also causes errors downstream in neural-lam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Digging a bit more here I am not sure if it actually is the cropping that causes this. It still is an issue, but I can't figure out what causes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into it
Describe your changes
The aim of this PR is to add support for cropping one dataset using the convex hull of lat/lon coordinates of grid-points in another dataset.
This is motivated by the need when doing LAM (limited area) based forecasting to be able to define forcing input in a padding region around the limited area. This could for example be when training on DANRA reanalysis one might want to force on the boundary with ERA5 reanalysis. This work builds on @TomasLandelius proof-of-concept implementation (mllam/weather-model-graphs#30) which takes the convex hull of the lat/lon coordinates of grid-points in the limited area (e.g. DANRA) domain, splits the points in the boundary domain by whether they are interior or external to this convex hull (again using the lat/lon coordinates) and defines points in the boundary region as those that are within a given distance (in radians from Earth's center) to the closest point within the convex hull.
So far I have taken the implementation from @TomasLandelius and structured into a notebook (and used DANRA and ERA5 zarr datasets in cloud storage).
Below I have made an example where ERA5 sample points are selected that are within 0.2 radians of the convex hull of grid-points in DANRA:
For the lat/lon convex hull calculations the spherical-geometry package is required. For plotting of course
matplotlib
andcartopy
. I am thinking of putting the visualiation related packages in an optional[visualisation]
extra requirements, but I am not sure if it makes sense to always require the spherical-geometry package.Issue Link
Implements mllam/weather-model-graphs#30
Type of change
Checklist before requesting a review
pull
with--rebase
option if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee