-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add clustering functions #688
Conversation
I have thought that a nice interface for clustering would be to return a list of cluster ids for the list of input geometries. That way, the client can (a) know which input geometry belongs to each cluster, and (b) choose whether or not to materialize the clusters as new geometries (and a utility function could support this). This would also allow the functionality to be used to back window functions in PostGIS. Does this PR support this kind of interface? |
What is the use case for the envelope clustering functions? |
Is ClusterGeometryIntersecting just a special case of ClusterGeometryDistance (with distance = 0) ? |
I wouldn't expect this code to be desirable for PostGIS because PostGIS already has its own implementation that does not require converting everything into GEOS types. But maybe it would be useful for other bindings or for algorithms within GEOS.
At the C++ level, it can and should be modified to support this. In the C API....do you want to post signatures for your ideal C interface? I think at the time I wrote this, I was trying to avoid introducing new types to the C API while also making the common use case of "assemble these geometries into clusters" relatively painless. Assembling clusters from a list of IDs for each input is a actually a bit painful in C (You can see how I implemented it for PostGIS at https://github.com/postgis/postgis/blob/1dbae611ab087f5c78d94c4a8955196bf887fa9c/liblwgeom/lwgeom_geos_cluster.c#L542.). But maybe nobody is likely to be doing this; they'd be assembling the clusters in Python etc.
Fast alternative to distance or intersection for coarse partitioning of data.
Like in PostGIS, it's implemented using the intersection predicate which is not always consistent with distance == 0. |
I changed the C++ interface a bit and the following function is now public:
this can be transformed into a vector of cluster IDs whose elements correspond to |
which I now realize will not work for DBSCAN clustering because not all inputs are assigned a cluster. Would you want some placeholder cluster ID value used in this case? Or an array of booleans to accompany the array of IDs? Edit: I guess the two-array method is how PostGIS does it internally. |
That's not good. Do you have an example? Is this a true semantic difference, or just a robustness issue? |
An ID indicating "no cluster assigned" seems pretty simple. E.g. ID = -1. |
How about an array of ints? I suppose the size is needed. Infer from the input size? Provide as an out parameter? Indicate via a sentinel value in the array? |
That said, maybe
I'd be fine with removing the C API for now. This interface would have been painful for my use case for the reasons I described above, but maybe I'm the only one. |
Not sure I understand. A lot of use of GEOS comes through the C API, so that seems important to provide? |
I wonder whether the input should be provided as a array of geometries, rather than a |
I guess you could confirm by testing using this PR? Would be interesting to know if there is a discrepancy. (Also why this happens in PostGIS - but I think the distance code and predicate code there are of different origins, so maybe have different contracts?) |
Understood. I'm curious if there is any papers or discussion about using this technique. Envelopes are a pretty coarse approximation, so it seems like this might have some undesirable characteristics in practice. |
I'm really trying to follow the contours of the existing C API, which uses
It's ideal to expose it to the C API, but it doesn't have to all happen at once. If you don't like the API I've proposed and you don't have a detailed alternative, I don't think there's anything wrong with punting.
I used it to optimize raster reads from sparse polygons (think USA or France with overseas territories.) I'm not aware of any papers or discussion of it. |
Here's an example where envelope pre-clustering improves
using test |
Although swapping in a |
Flattening the inputs, per the latest commit, brings the performance gain to 60% for this dataset. |
f8e74e3
to
bf92b2b
Compare
I've gone ahead and added pre-clustering by intersection to |
Also added pre-clustering to |
include/geos/operation/cluster/GeometryIntersectsClusterFinder.h
Outdated
Show resolved
Hide resolved
It would be nice if the PostGIS functions |
I would expect PostGIS to stick with its existing code which already provides this capability (look at |
I've removed all changes to the C API and Union operations. Will post the union changes in a separate PR. |
This PR is a rebase of #388 which died on the vine in 2021. It adds the DBSCAN, simple distance, and intersection clustering methods available in PostGIS. Also adds clustering based on envelope intersection and envelope distance.