You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that RBC is showing great potential, scale, and speedups, I've been thinking of ways the RBC algorithm can scale and accelerate multiple different algorithms:
K-means: The FAISS block-select can be replaced w/ a single register and only the distances between the data points and their closest landmarks need to be computed in order to find the closest centroids, each point would then visit the neighborhoods for their closest landmarks, using the triangle inequality to skip computing any distances that could not possibly be their nearest neighbor.
DBSCAN: Rather than computing the k-closest landmarks, the pairwise distance matrix between the training data points and the landmarks is computed and then thresholded by the eps hyperparameter to determine which landmarks might potentially contain the epsilon neighborhoods. Since the radii of the landmarks is also computed, it's possible to prune the neighborhoods for entire landmarks. Another added benefit to this approach is the fact that the algorithm also knows how many points are inside each landmark's neighborhood so we can issue warnings or be able to tell whether the eps is opened up wide enough that GPU memory might end up being an issue. For example, this could also be used to batch automatically in the case when we know the entire epsilon neighborhood graph will be too close to n^2 and too large to compute all at once.
HDBSCAN: This is very straightforward but will provide probably the greatest opportunity for acceleration since the k-nearest neighbors need to be computed once in order to get the core distances and again in order to project the neighborhoods into mutual reachability space. What's nice is that this could be an opportunity to reuse the landmarks across both runs to eliminate having to sample and construct the landmark 1nns twice.
KNN: This is a direct benefit to NearestNeighbors as well as the regression and classification algorithms that build on top. Further, this is a benefit to the MNMG KNN. While it won't necessarily limite communication, it will significantly reduce the time spent computing the individual KNNs.
UMAP: This benefit is two-fold- it will speed up the all-neighbors knn computation to provide faster embeddings and the ability to store off and reuse the sparse landmarks 1nn index for future inference (which is a very small data structure).
T-SNE: The brute-force KNN can be replaced directly.
SLHC: The brute-force KNN can be replaced directly.
Spectral Clustering: The brute-force knn can be replaced directly.
radius_neighbors(): This is something we've been wanting to support for some time in the NearestNeighbors estimator but have not been able to because of the n^2 requirement our current epsilon neighborhood prim. What's neat about the random ball cover is that the cardinalities of the neighborhoods around each landmark are known ahead of time and this makes it easier to know when the potential (sparse) output neighborhood array could potentially be on the order of n^2, which is possible when the epsilon radius is set to too large of a value.
** The approximate variant can also be used in DBSCAN, HDBSCAN, KNN, UMAP, T-SNE, SLHC, and spectral clustering to get further speedups. The approximate algorithm just applies an additional weight that selects the radius by which landmark balls are considered for neighborhoods (balls outside the radius are skipped entirely).
The text was updated successfully, but these errors were encountered:
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
Now that RBC is showing great potential, scale, and speedups, I've been thinking of ways the RBC algorithm can scale and accelerate multiple different algorithms:
K-means: The FAISS block-select can be replaced w/ a single register and only the distances between the data points and their closest landmarks need to be computed in order to find the closest centroids, each point would then visit the neighborhoods for their closest landmarks, using the triangle inequality to skip computing any distances that could not possibly be their nearest neighbor.
DBSCAN: Rather than computing the k-closest landmarks, the pairwise distance matrix between the training data points and the landmarks is computed and then thresholded by the
eps
hyperparameter to determine which landmarks might potentially contain the epsilon neighborhoods. Since the radii of the landmarks is also computed, it's possible to prune the neighborhoods for entire landmarks. Another added benefit to this approach is the fact that the algorithm also knows how many points are inside each landmark's neighborhood so we can issue warnings or be able to tell whether theeps
is opened up wide enough that GPU memory might end up being an issue. For example, this could also be used to batch automatically in the case when we know the entire epsilon neighborhood graph will be too close to n^2 and too large to compute all at once.HDBSCAN: This is very straightforward but will provide probably the greatest opportunity for acceleration since the k-nearest neighbors need to be computed once in order to get the core distances and again in order to project the neighborhoods into mutual reachability space. What's nice is that this could be an opportunity to reuse the landmarks across both runs to eliminate having to sample and construct the landmark 1nns twice.
KNN: This is a direct benefit to
NearestNeighbors
as well as the regression and classification algorithms that build on top. Further, this is a benefit to the MNMG KNN. While it won't necessarily limite communication, it will significantly reduce the time spent computing the individual KNNs.UMAP: This benefit is two-fold- it will speed up the all-neighbors knn computation to provide faster embeddings and the ability to store off and reuse the sparse landmarks 1nn index for future inference (which is a very small data structure).
T-SNE: The brute-force KNN can be replaced directly.
SLHC: The brute-force KNN can be replaced directly.
Spectral Clustering: The brute-force knn can be replaced directly.
radius_neighbors(): This is something we've been wanting to support for some time in the
NearestNeighbors
estimator but have not been able to because of the n^2 requirement our current epsilon neighborhood prim. What's neat about the random ball cover is that the cardinalities of the neighborhoods around each landmark are known ahead of time and this makes it easier to know when the potential (sparse) output neighborhood array could potentially be on the order of n^2, which is possible when the epsilon radius is set to too large of a value.** The approximate variant can also be used in DBSCAN, HDBSCAN, KNN, UMAP, T-SNE, SLHC, and spectral clustering to get further speedups. The approximate algorithm just applies an additional weight that selects the radius by which landmark balls are considered for neighborhoods (balls outside the radius are skipped entirely).
The text was updated successfully, but these errors were encountered: