Set default value for `n_closest` for finding nearest neighbors with `get_closest_points_to_point` #51

JochenSeidel · 2024-05-13T13:17:25Z

n_closest is a required input for the plg.spatial.get_closest_points_to_point function. I would suggest to make it optional since the number of neighbouring stations might be irrelevant for some calculations, e.g. the indicator correlation. Optionally, the number could be set to a very high number...

PS: I wanted to label this as "discussion" but I don't know how that works....

cchwala · 2024-05-14T07:39:13Z

The default for n_closest in the underlying function scipy.spatial.KDTree.query is k=1, where k is the number of neighbors.

If we make n_closest optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to set n_closest to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix is N_stations x n_closest.

We cannot omit n_closest, because scipy.spatial.KDTree,query always gets a k, which is k=1 as default, as written above.

So, the only two options I see are:

Require n_closest as function argument, which is the current behavior.
Set default n_closest=1

Since, I expect that most people do not want n_closest=1 very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.

cchwala · 2024-05-14T07:41:48Z

Maybe @lepetersson could have a look at this discussion before the meeting today.

JochenSeidel · 2024-05-14T08:09:05Z

The default for n_closest in the underlying function scipy.spatial.KDTree.query is k=1, where k is the number of neighbors.

If we make n_closest optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to set n_closest to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix is N_stations x n_closest.

We cannot omit n_closest, because scipy.spatial.KDTree,query always gets a k, which is k=1 as default, as written above.

So, the only two options I see are:
1. Require `n_closest` as function argument, which is the current behavior.

2. Set default `n_closest=1`
Since, I expect that most people do not want n_closest=1 very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.

Would it make sense to setn_closest based on the number of stations within the max_distance as default?

lepetersson · 2024-05-14T08:41:13Z

The default for n_closest in the underlying function scipy.spatial.KDTree.query is k=1, where k is the number of neighbors.
If we make n_closest optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to set n_closest to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix is N_stations x n_closest.
We cannot omit n_closest, because scipy.spatial.KDTree,query always gets a k, which is k=1 as default, as written above.
So, the only two options I see are:
1. Require `n_closest` as function argument, which is the current behavior.

2. Set default `n_closest=1`
Since, I expect that most people do not want n_closest=1 very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.
Would it make sense to setn_closest based on the number of stations within the max_distance as default?

I think it would be more intutitive to not have to set n_closest, but rather (for example) loop through the stations and the distance matrix and create a xr dataset where the neighbour list per station is stored. In this way the user don't need to care about selecting n_closest. Perhaps you can look for the highest number of neighbours first and then set n_closest to that. Let's discuss later. Referring to this documentation

cchwala · 2024-05-14T09:35:14Z

Conclusion based on discussion today:

as default n_closest should be the number of potential neighboring stations
max_distance should be a required parameter
add to documentation info that n_closest has to be set if input datasets are really large, because resulting data will be even larger

eoydvin · 2024-05-14T17:55:19Z

Adding some thoughts to the discussion: One workaround using current code could be to select the station you want to find all neighbors to like this:

closest_neigbors = plg.spatial.get_closest_points_to_point(
    ds_points=ds_pws,
    ds_points_neighbors=ds_pws.sel(id = pws_id),
    max_distance=max_distance,
    n_closest=1, # not relevant, we select the first station regardless
).isel(n_closest = 0)

Then all closest neighbors to station "pws_id" can be found like this:

neighbor_ids = closest_neigbors.where(closest_neigbors.neighbor_id != None,  drop = True).id
ds_pws_neighbors = ds_pws.sel(id=neighbor_ids)
plt.scatter(ds_pws_neighbors.x, ds_pws_neighbors.y, c="C0", s=30)

Its not the most intuitive syntax and for it to work you would have to loop through all the PWSs, at the cost of computational time. Another potential option could be to use the kd_tree.query_ball_point like I do in #43, possibly by calculating all distances at once or by looking through all stations... If the behavior of get_closest_points_to_point is changed we should consider adapting the output in #43 to be similar?

cchwala added the discussion label May 13, 2024

cchwala changed the title ~~n_closest for finding neighbors from distance matrix~~ Set default value for n_closest for finding nearest neighbors with get_closest_points_to_point May 14, 2024

cchwala added maintenance and removed discussion labels May 14, 2024

cchwala added this to Software dev till joint meeting in Milano, June 2024 May 14, 2024

cchwala moved this to Todo in Software dev till joint meeting in Milano, June 2024 May 14, 2024

cchwala removed this from Software dev till joint meeting in Milano, June 2024 Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set default value for `n_closest` for finding nearest neighbors with `get_closest_points_to_point` #51

Set default value for `n_closest` for finding nearest neighbors with `get_closest_points_to_point` #51

JochenSeidel commented May 13, 2024 •

edited

Loading

cchwala commented May 14, 2024 •

edited

Loading

cchwala commented May 14, 2024

JochenSeidel commented May 14, 2024

lepetersson commented May 14, 2024

cchwala commented May 14, 2024

eoydvin commented May 14, 2024 •

edited by cchwala

Loading

Set default value for n_closest for finding nearest neighbors with get_closest_points_to_point #51

Set default value for n_closest for finding nearest neighbors with get_closest_points_to_point #51

Comments

JochenSeidel commented May 13, 2024 • edited Loading

cchwala commented May 14, 2024 • edited Loading

cchwala commented May 14, 2024

JochenSeidel commented May 14, 2024

lepetersson commented May 14, 2024

cchwala commented May 14, 2024

eoydvin commented May 14, 2024 • edited by cchwala Loading

Set default value for `n_closest` for finding nearest neighbors with `get_closest_points_to_point` #51

Set default value for `n_closest` for finding nearest neighbors with `get_closest_points_to_point` #51

JochenSeidel commented May 13, 2024 •

edited

Loading

cchwala commented May 14, 2024 •

edited

Loading

eoydvin commented May 14, 2024 •

edited by cchwala

Loading