Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] request for HDBSCAN clustering #1783

Closed
mvss80 opened this issue Mar 2, 2020 · 10 comments
Closed

[FEA] request for HDBSCAN clustering #1783

mvss80 opened this issue Mar 2, 2020 · 10 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request New Algorithm For tracking new algorithms that will be added to our existing collection

Comments

@mvss80
Copy link

mvss80 commented Mar 2, 2020

Are there any plans to get HDBSCAN implemented?

https://github.com/scikit-learn-contrib/hdbscan

@mvss80 mvss80 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Mar 2, 2020
@cjnolet
Copy link
Member

cjnolet commented Mar 2, 2020

HDBSCAN is on our radar, but it is not on our roadmap currently.

That being said, it certainly helps us with future prioritization to know that individuals in the community are interested in this.

@teju85 teju85 added the New Algorithm For tracking new algorithms that will be added to our existing collection label Apr 14, 2020
@noob-procrastinator
Copy link

Any development regarding HDBSCAN? Using UMAP+ HDBSCAN is quite the wonder!

@cjnolet
Copy link
Member

cjnolet commented Mar 11, 2021

@noob-procrastinator there has absolutely been progress on HDBSCAN. We are currently in the process of integrating single-linkage hierarchical clustering (rapidsai/raft#140) as a core building block to HDBSCAN. The major challenge here was being able to scale these algorithms with a knn graph to lower the n^2 memory requirement.

I have a draft PR open for HDBSCAN so you can track progress: #3546

@cjnolet
Copy link
Member

cjnolet commented Jun 16, 2021

@noob-procrastinator,

An experimental verison of HDBSCAN was merged into version 21.06 and we are continuing to improve it in 21.08 and beyond. Please let us know as you find bugs and whether there are any missing options / features you'd like to use.

@versis
Copy link

versis commented Jul 28, 2021

tl;dr
@cjnolet does your implementation requires more memory than the original hdbscan implementation (https://hdbscan.readthedocs.io)?

My data is few million data points in 768 dim space. I also used UMAP to reduce dim, but it wasn't enough.

I was benchmarking both agglomerative clustering from fastcluster lib and hdbscan.
AC was much faster, but required too much ram.
HDBSCAN used just a little memory, but required much more time.
RAPIDSAI seems to be much faster, but is the memory usage the same?

@cjnolet
Copy link
Member

cjnolet commented Jul 29, 2021

@versis,

The current HDBSCAN implementation in cuML uses brute-force nearest neighbors instead of having to materialize the full pairwise distance matrix in memory. We also built a primitive to connect the KNN graph in order to get the exact solution (and guarantee convergence of the MST.) We are planninng to use approximate nearest neighbors in a future release to speed up the computation further but the brute-force is pretty great for the initial version.

The GoogleNews Word2Vec dataset (3Mx300) takes ~22min on my V100 (32gb). We are going to work to make it even faster, but that's still very good considering the Scikit-learn contrib version had to be stopped after a day of running. Also, our initial version supports only the base set of features in the algorithm but please let us know if there are missing features you'd like to see such as the fuzzy clustering or inference on out of sample datapoints.

@versis
Copy link

versis commented Aug 2, 2021

@cjnolet Thank you for quick response. 22 min seems promising! We will take a look, although inference is must have feature for us ;/

@sparkdoc
Copy link

sparkdoc commented Aug 3, 2021

Will you be including the soft clustering capability?
https://hdbscan.readthedocs.io/en/latest/soft_clustering.html#soft-clustering-for-hdbscan

@beckernick
Copy link
Member

beckernick commented Nov 29, 2021

@dantegd @cjnolet does it make sense to close this issue and instead file separate issues for different requests related o HDBSCAN (soft clustering, different algorithms, inference, prediction_data, etc.)? Or perhaps collect various individual requests in this issue

@cjnolet
Copy link
Member

cjnolet commented Jan 12, 2022

@beckernick, yep I definitely think it makes sense to close this issue.

I also want to link relevant issues here for reference:

  1. soft clustering [FEA] Soft clustering with HDBSCAN #4467
  2. prediction / inference: [FEA] Need a approximate_predict function for cuml HDBSCAN #4448
  3. different distance measures (and additional planned features): [TASK] Post HDBSCAN merge tasks #3879
  4. speeding up algorithm w/ 3-15 dimensions: [FEA] Scaling several neighborhood methods w/ RBC #4161

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request New Algorithm For tracking new algorithms that will be added to our existing collection
Projects
None yet
Development

No branches or pull requests

7 participants