Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set default value for n_closest for finding nearest neighbors with get_closest_points_to_point #51

Open
JochenSeidel opened this issue May 13, 2024 · 6 comments

Comments

@JochenSeidel
Copy link
Contributor

JochenSeidel commented May 13, 2024

n_closest is a required input for the plg.spatial.get_closest_points_to_point function. I would suggest to make it optional since the number of neighbouring stations might be irrelevant for some calculations, e.g. the indicator correlation. Optionally, the number could be set to a very high number...

PS: I wanted to label this as "discussion" but I don't know how that works....

@cchwala
Copy link
Member

cchwala commented May 14, 2024

The default for n_closest in the underlying function scipy.spatial.KDTree.query is k=1, where k is the number of neighbors.

If we make n_closest optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to set n_closest to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix is N_stations x n_closest.

We cannot omit n_closest, because scipy.spatial.KDTree,query always gets a k, which is k=1 as default, as written above.

So, the only two options I see are:

  1. Require n_closest as function argument, which is the current behavior.
  2. Set default n_closest=1

Since, I expect that most people do not want n_closest=1 very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.

@cchwala
Copy link
Member

cchwala commented May 14, 2024

Maybe @lepetersson could have a look at this discussion before the meeting today.

@JochenSeidel
Copy link
Contributor Author

The default for n_closest in the underlying function scipy.spatial.KDTree.query is k=1, where k is the number of neighbors.

If we make n_closest optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to set n_closest to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix is N_stations x n_closest.

We cannot omit n_closest, because scipy.spatial.KDTree,query always gets a k, which is k=1 as default, as written above.

So, the only two options I see are:

1. Require `n_closest` as function argument, which is the current behavior.

2. Set default `n_closest=1`

Since, I expect that most people do not want n_closest=1 very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.

Would it make sense to setn_closest based on the number of stations within the max_distance as default?

@lepetersson
Copy link

The default for n_closest in the underlying function scipy.spatial.KDTree.query is k=1, where k is the number of neighbors.
If we make n_closest optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to set n_closest to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix is N_stations x n_closest.
We cannot omit n_closest, because scipy.spatial.KDTree,query always gets a k, which is k=1 as default, as written above.
So, the only two options I see are:

1. Require `n_closest` as function argument, which is the current behavior.

2. Set default `n_closest=1`

Since, I expect that most people do not want n_closest=1 very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.

Would it make sense to setn_closest based on the number of stations within the max_distance as default?

I think it would be more intutitive to not have to set n_closest, but rather (for example) loop through the stations and the distance matrix and create a xr dataset where the neighbour list per station is stored. In this way the user don't need to care about selecting n_closest. Perhaps you can look for the highest number of neighbours first and then set n_closest to that. Let's discuss later. Referring to this documentation

@cchwala
Copy link
Member

cchwala commented May 14, 2024

Conclusion based on discussion today:

  1. as default n_closest should be the number of potential neighboring stations
  2. max_distance should be a required parameter
  3. add to documentation info that n_closest has to be set if input datasets are really large, because resulting data will be even larger

@cchwala cchwala changed the title n_closest for finding neighbors from distance matrix Set default value for n_closest for finding nearest neighbors with get_closest_points_to_point May 14, 2024
@eoydvin
Copy link
Contributor

eoydvin commented May 14, 2024

Adding some thoughts to the discussion: One workaround using current code could be to select the station you want to find all neighbors to like this:

closest_neigbors = plg.spatial.get_closest_points_to_point(
    ds_points=ds_pws,
    ds_points_neighbors=ds_pws.sel(id = pws_id),
    max_distance=max_distance,
    n_closest=1, # not relevant, we select the first station regardless
).isel(n_closest = 0)

Then all closest neighbors to station "pws_id" can be found like this:

neighbor_ids = closest_neigbors.where(closest_neigbors.neighbor_id != None,  drop = True).id
ds_pws_neighbors = ds_pws.sel(id=neighbor_ids)
plt.scatter(ds_pws_neighbors.x, ds_pws_neighbors.y, c="C0", s=30)

Its not the most intuitive syntax and for it to work you would have to loop through all the PWSs, at the cost of computational time. Another potential option could be to use the kd_tree.query_ball_point like I do in #43, possibly by calculating all distances at once or by looking through all stations... If the behavior of get_closest_points_to_point is changed we should consider adapting the output in #43 to be similar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants