Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Scaling several neighborhood methods w/ RBC #4161

Open
9 tasks
cjnolet opened this issue Aug 13, 2021 · 2 comments
Open
9 tasks

[FEA] Scaling several neighborhood methods w/ RBC #4161

cjnolet opened this issue Aug 13, 2021 · 2 comments
Labels

Comments

@cjnolet
Copy link
Member

cjnolet commented Aug 13, 2021

Now that RBC is showing great potential, scale, and speedups, I've been thinking of ways the RBC algorithm can scale and accelerate multiple different algorithms:

  1. K-means: The FAISS block-select can be replaced w/ a single register and only the distances between the data points and their closest landmarks need to be computed in order to find the closest centroids, each point would then visit the neighborhoods for their closest landmarks, using the triangle inequality to skip computing any distances that could not possibly be their nearest neighbor.

  2. DBSCAN: Rather than computing the k-closest landmarks, the pairwise distance matrix between the training data points and the landmarks is computed and then thresholded by the eps hyperparameter to determine which landmarks might potentially contain the epsilon neighborhoods. Since the radii of the landmarks is also computed, it's possible to prune the neighborhoods for entire landmarks. Another added benefit to this approach is the fact that the algorithm also knows how many points are inside each landmark's neighborhood so we can issue warnings or be able to tell whether the eps is opened up wide enough that GPU memory might end up being an issue. For example, this could also be used to batch automatically in the case when we know the entire epsilon neighborhood graph will be too close to n^2 and too large to compute all at once.

  3. HDBSCAN: This is very straightforward but will provide probably the greatest opportunity for acceleration since the k-nearest neighbors need to be computed once in order to get the core distances and again in order to project the neighborhoods into mutual reachability space. What's nice is that this could be an opportunity to reuse the landmarks across both runs to eliminate having to sample and construct the landmark 1nns twice.

  4. KNN: This is a direct benefit to NearestNeighbors as well as the regression and classification algorithms that build on top. Further, this is a benefit to the MNMG KNN. While it won't necessarily limite communication, it will significantly reduce the time spent computing the individual KNNs.

  5. UMAP: This benefit is two-fold- it will speed up the all-neighbors knn computation to provide faster embeddings and the ability to store off and reuse the sparse landmarks 1nn index for future inference (which is a very small data structure).

  6. T-SNE: The brute-force KNN can be replaced directly.

  7. SLHC: The brute-force KNN can be replaced directly.

  8. Spectral Clustering: The brute-force knn can be replaced directly.

  9. radius_neighbors(): This is something we've been wanting to support for some time in the NearestNeighbors estimator but have not been able to because of the n^2 requirement our current epsilon neighborhood prim. What's neat about the random ball cover is that the cardinalities of the neighborhoods around each landmark are known ahead of time and this makes it easier to know when the potential (sparse) output neighborhood array could potentially be on the order of n^2, which is possible when the epsilon radius is set to too large of a value.

** The approximate variant can also be used in DBSCAN, HDBSCAN, KNN, UMAP, T-SNE, SLHC, and spectral clustering to get further speedups. The approximate algorithm just applies an additional weight that selects the radius by which landmark balls are considered for neighborhoods (balls outside the radius are skipped entirely).

@cjnolet cjnolet added feature request New feature or request ? - Needs Triage Need team to review and classify labels Aug 13, 2021
@cjnolet cjnolet added CUDA / C++ CUDA issue and removed ? - Needs Triage Need team to review and classify labels Aug 13, 2021
@cjnolet cjnolet changed the title [FEA] Scaling several neighborhood algorithms w/ RBC [FEA] Scaling several neighborhood methods w/ RBC Aug 13, 2021
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant