[FEA] request for HDBSCAN clustering #1783

mvss80 · 2020-03-02T19:47:56Z

Are there any plans to get HDBSCAN implemented?

https://github.com/scikit-learn-contrib/hdbscan

cjnolet · 2020-03-02T23:37:42Z

HDBSCAN is on our radar, but it is not on our roadmap currently.

That being said, it certainly helps us with future prioritization to know that individuals in the community are interested in this.

noob-procrastinator · 2021-03-11T17:17:04Z

Any development regarding HDBSCAN? Using UMAP+ HDBSCAN is quite the wonder!

cjnolet · 2021-03-11T17:25:47Z

@noob-procrastinator there has absolutely been progress on HDBSCAN. We are currently in the process of integrating single-linkage hierarchical clustering (rapidsai/raft#140) as a core building block to HDBSCAN. The major challenge here was being able to scale these algorithms with a knn graph to lower the n^2 memory requirement.

I have a draft PR open for HDBSCAN so you can track progress: #3546

cjnolet · 2021-06-16T17:30:05Z

@noob-procrastinator,

An experimental verison of HDBSCAN was merged into version 21.06 and we are continuing to improve it in 21.08 and beyond. Please let us know as you find bugs and whether there are any missing options / features you'd like to use.

versis · 2021-07-28T08:25:58Z

tl;dr
@cjnolet does your implementation requires more memory than the original hdbscan implementation (https://hdbscan.readthedocs.io)?

My data is few million data points in 768 dim space. I also used UMAP to reduce dim, but it wasn't enough.

I was benchmarking both agglomerative clustering from fastcluster lib and hdbscan.
AC was much faster, but required too much ram.
HDBSCAN used just a little memory, but required much more time.
RAPIDSAI seems to be much faster, but is the memory usage the same?

cjnolet · 2021-07-29T22:03:21Z

@versis,

The current HDBSCAN implementation in cuML uses brute-force nearest neighbors instead of having to materialize the full pairwise distance matrix in memory. We also built a primitive to connect the KNN graph in order to get the exact solution (and guarantee convergence of the MST.) We are planninng to use approximate nearest neighbors in a future release to speed up the computation further but the brute-force is pretty great for the initial version.

The GoogleNews Word2Vec dataset (3Mx300) takes ~22min on my V100 (32gb). We are going to work to make it even faster, but that's still very good considering the Scikit-learn contrib version had to be stopped after a day of running. Also, our initial version supports only the base set of features in the algorithm but please let us know if there are missing features you'd like to see such as the fuzzy clustering or inference on out of sample datapoints.

versis · 2021-08-02T09:41:21Z

@cjnolet Thank you for quick response. 22 min seems promising! We will take a look, although inference is must have feature for us ;/

sparkdoc · 2021-08-03T19:21:39Z

Will you be including the soft clustering capability?
https://hdbscan.readthedocs.io/en/latest/soft_clustering.html#soft-clustering-for-hdbscan

beckernick · 2021-11-29T15:36:33Z

@dantegd @cjnolet does it make sense to close this issue and instead file separate issues for different requests related o HDBSCAN (soft clustering, different algorithms, inference, prediction_data, etc.)? Or perhaps collect various individual requests in this issue

cjnolet · 2022-01-12T23:21:36Z

@beckernick, yep I definitely think it makes sense to close this issue.

I also want to link relevant issues here for reference:

soft clustering [FEA] Soft clustering with HDBSCAN #4467
prediction / inference: [FEA] Need a approximate_predict function for cuml HDBSCAN #4448
different distance measures (and additional planned features): [TASK] Post HDBSCAN merge tasks #3879
speeding up algorithm w/ 3-15 dimensions: [FEA] Scaling several neighborhood methods w/ RBC #4161

mvss80 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Mar 2, 2020

teju85 added the New Algorithm For tracking new algorithms that will be added to our existing collection label Apr 14, 2020

teju85 mentioned this issue Apr 14, 2020

[FEA] Is there any plan to add HDBSCAN in cuml in the future? #1715

Closed

cjnolet mentioned this issue Aug 20, 2020

[FEA] Take advantage of MST in hierarchical clustering #2727

Closed

cjnolet closed this as completed Jan 12, 2022

jcfaracco mentioned this issue May 4, 2022

[QST] Why does scikit HDBSCAN return different results when we compare with CuML's HDBSCAN? #4723

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] request for HDBSCAN clustering #1783

[FEA] request for HDBSCAN clustering #1783

mvss80 commented Mar 2, 2020

cjnolet commented Mar 2, 2020

noob-procrastinator commented Mar 11, 2021

cjnolet commented Mar 11, 2021

cjnolet commented Jun 16, 2021

versis commented Jul 28, 2021

cjnolet commented Jul 29, 2021 •

edited

Loading

versis commented Aug 2, 2021 •

edited

Loading

sparkdoc commented Aug 3, 2021

beckernick commented Nov 29, 2021 •

edited

Loading

cjnolet commented Jan 12, 2022

[FEA] request for HDBSCAN clustering #1783

[FEA] request for HDBSCAN clustering #1783

Comments

mvss80 commented Mar 2, 2020

cjnolet commented Mar 2, 2020

noob-procrastinator commented Mar 11, 2021

cjnolet commented Mar 11, 2021

cjnolet commented Jun 16, 2021

versis commented Jul 28, 2021

cjnolet commented Jul 29, 2021 • edited Loading

versis commented Aug 2, 2021 • edited Loading

sparkdoc commented Aug 3, 2021

beckernick commented Nov 29, 2021 • edited Loading

cjnolet commented Jan 12, 2022

cjnolet commented Jul 29, 2021 •

edited

Loading

versis commented Aug 2, 2021 •

edited

Loading

beckernick commented Nov 29, 2021 •

edited

Loading