[FEA] Scaling several neighborhood methods w/ RBC #4161

cjnolet · 2021-08-13T21:17:34Z

Now that RBC is showing great potential, scale, and speedups, I've been thinking of ways the RBC algorithm can scale and accelerate multiple different algorithms:

K-means: The FAISS block-select can be replaced w/ a single register and only the distances between the data points and their closest landmarks need to be computed in order to find the closest centroids, each point would then visit the neighborhoods for their closest landmarks, using the triangle inequality to skip computing any distances that could not possibly be their nearest neighbor.
DBSCAN: Rather than computing the k-closest landmarks, the pairwise distance matrix between the training data points and the landmarks is computed and then thresholded by the eps hyperparameter to determine which landmarks might potentially contain the epsilon neighborhoods. Since the radii of the landmarks is also computed, it's possible to prune the neighborhoods for entire landmarks. Another added benefit to this approach is the fact that the algorithm also knows how many points are inside each landmark's neighborhood so we can issue warnings or be able to tell whether the eps is opened up wide enough that GPU memory might end up being an issue. For example, this could also be used to batch automatically in the case when we know the entire epsilon neighborhood graph will be too close to n^2 and too large to compute all at once.
HDBSCAN: This is very straightforward but will provide probably the greatest opportunity for acceleration since the k-nearest neighbors need to be computed once in order to get the core distances and again in order to project the neighborhoods into mutual reachability space. What's nice is that this could be an opportunity to reuse the landmarks across both runs to eliminate having to sample and construct the landmark 1nns twice.
KNN: This is a direct benefit to NearestNeighbors as well as the regression and classification algorithms that build on top. Further, this is a benefit to the MNMG KNN. While it won't necessarily limite communication, it will significantly reduce the time spent computing the individual KNNs.
UMAP: This benefit is two-fold- it will speed up the all-neighbors knn computation to provide faster embeddings and the ability to store off and reuse the sparse landmarks 1nn index for future inference (which is a very small data structure).
T-SNE: The brute-force KNN can be replaced directly.
SLHC: The brute-force KNN can be replaced directly.
Spectral Clustering: The brute-force knn can be replaced directly.
radius_neighbors(): This is something we've been wanting to support for some time in the NearestNeighbors estimator but have not been able to because of the n^2 requirement our current epsilon neighborhood prim. What's neat about the random ball cover is that the cardinalities of the neighborhoods around each landmark are known ahead of time and this makes it easier to know when the potential (sparse) output neighborhood array could potentially be on the order of n^2, which is possible when the epsilon radius is set to too large of a value.

** The approximate variant can also be used in DBSCAN, HDBSCAN, KNN, UMAP, T-SNE, SLHC, and spectral clustering to get further speedups. The approximate algorithm just applies an additional weight that selects the radius by which landmark balls are considered for neighborhoods (balls outside the radius are skipped entirely).

The text was updated successfully, but these errors were encountered:

github-actions · 2021-11-23T20:03:00Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-11-23T20:03:50Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

cjnolet added feature request New feature or request ? - Needs Triage Need team to review and classify labels Aug 13, 2021

cjnolet added CUDA / C++ CUDA issue and removed ? - Needs Triage Need team to review and classify labels Aug 13, 2021

cjnolet changed the title ~~[FEA] Scaling several neighborhood algorithms w/ RBC~~ [FEA] Scaling several neighborhood methods w/ RBC Aug 13, 2021

cjnolet mentioned this issue Aug 13, 2021

[REVIEW] Random Ball Cover Algorithm for 2D Haversine/Euclidean rapidsai/raft#213

Merged

github-actions bot added the inactive-90d label Nov 23, 2021

github-actions bot added the inactive-30d label Nov 23, 2021

cjnolet mentioned this issue Jan 12, 2022

[FEA] request for HDBSCAN clustering #1783

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Scaling several neighborhood methods w/ RBC #4161

[FEA] Scaling several neighborhood methods w/ RBC #4161

cjnolet commented Aug 13, 2021 •

edited

Loading

github-actions bot commented Nov 23, 2021

github-actions bot commented Nov 23, 2021

[FEA] Scaling several neighborhood methods w/ RBC #4161

[FEA] Scaling several neighborhood methods w/ RBC #4161

Comments

cjnolet commented Aug 13, 2021 • edited Loading

github-actions bot commented Nov 23, 2021

github-actions bot commented Nov 23, 2021

cjnolet commented Aug 13, 2021 •

edited

Loading