-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] request for HDBSCAN clustering #1783
Comments
HDBSCAN is on our radar, but it is not on our roadmap currently. That being said, it certainly helps us with future prioritization to know that individuals in the community are interested in this. |
Any development regarding HDBSCAN? Using UMAP+ HDBSCAN is quite the wonder! |
@noob-procrastinator there has absolutely been progress on HDBSCAN. We are currently in the process of integrating single-linkage hierarchical clustering (rapidsai/raft#140) as a core building block to HDBSCAN. The major challenge here was being able to scale these algorithms with a knn graph to lower the I have a draft PR open for HDBSCAN so you can track progress: #3546 |
An experimental verison of HDBSCAN was merged into version 21.06 and we are continuing to improve it in 21.08 and beyond. Please let us know as you find bugs and whether there are any missing options / features you'd like to use. |
tl;dr My data is few million data points in 768 dim space. I also used UMAP to reduce dim, but it wasn't enough. I was benchmarking both agglomerative clustering from fastcluster lib and hdbscan. |
The current HDBSCAN implementation in cuML uses brute-force nearest neighbors instead of having to materialize the full pairwise distance matrix in memory. We also built a primitive to connect the KNN graph in order to get the exact solution (and guarantee convergence of the MST.) We are planninng to use approximate nearest neighbors in a future release to speed up the computation further but the brute-force is pretty great for the initial version. The GoogleNews Word2Vec dataset (3Mx300) takes ~22min on my V100 (32gb). We are going to work to make it even faster, but that's still very good considering the Scikit-learn contrib version had to be stopped after a day of running. Also, our initial version supports only the base set of features in the algorithm but please let us know if there are missing features you'd like to see such as the fuzzy clustering or inference on out of sample datapoints. |
@cjnolet Thank you for quick response. 22 min seems promising! We will take a look, although inference is must have feature for us ;/ |
Will you be including the soft clustering capability? |
@beckernick, yep I definitely think it makes sense to close this issue. I also want to link relevant issues here for reference:
|
Are there any plans to get HDBSCAN implemented?
https://github.com/scikit-learn-contrib/hdbscan
The text was updated successfully, but these errors were encountered: