[Question] Clustering on UMAP output #25

gabritaglia · 2017-11-23T15:30:55Z

Hi,

when using tSNE, it is usually not recommended to perform clustering on the "reduced space" with algorithms such as k-means or DBSCAN (and HDBSCAN?) because the dimensionality reduction applied by tSNE doesn't keep properties like relative distance and density (see https://stats.stackexchange.com/questions/263539/k-means-clustering-on-the-output-of-t-sne).

Would it make sense to perform such clustering (with k-means, DBSCAN, HDBSCAN etc.) on the UMAP output?

Thank you very much.

lmcinnes · 2017-11-23T16:23:04Z

I would certainly not apply K-Means to the results of UMAP (or t-SNE) output (since they rarely provide nice spherical clusters). On the other hand I feel that the linked answer is perhaps too cautious -- I don't think you can't apply a density based clustering algorithm to the results of t-SNE so much as that one needs to be careful in interpreting the results. t-SNE can certainly "create" sub-clusters that aren't entirely there (by separating parts of a cluster), and t-SNE does certainly discard some density information, so again, care is needed. In this sense I believe it is perfectly acceptable to perform clustering on the result providing you are going to submit the clusters to further analysis and verification. As long as you are not simply taking the results of clustering at face value (and you shouldn't really ever do that anyway) then the results can provide useful information about your data.

Now, having said all of that: UMAP does offer some improvements over t-SNE on this front. It is significantly less likely to create sub-clusters in the way t-SNE does, and it will do a better job of preserving density (though far from perfect, and requires small min_dist values). Thus you can have more confidence in the results of clustering UMAP than t-SNE, but I would still strongly encourage actual analysis of the clusters.

If you want evidence that this can work, using HDBSCAN on a UMAP embedding of the MNIST digits dataset (with suitable parameter choices for each algorithm) gave me an ARI of 0.92, which is remarkably good for a purely unsupervised approach, and is clearly capturing real information about the data.

My biggest caveat is with regard to noise in the data: UMAP and t-SNE will both tend to contract noise into clusters. If you have noisy data then UMAP and t-SNE will hide that from you, so it pays to have some awareness of what your data is like before just trusting a clustering (again, as is true of all clustering).

gabritaglia · 2017-11-23T16:30:37Z

Thank you so much for the deep answer. Very useful!

kgullikson88 · 2017-12-13T17:39:51Z

I have a related question: My intuition suggests using large n_neighbors makes sense if using UMAP prior to clustering, because it will better preserve the global structure. Do you agree? Do you have any other preliminary thoughts on parameter choices for combining UMAP with HDBSCAN?

lmcinnes · 2017-12-13T18:27:21Z

It is certainly true that small n_neighbors will tend to break up clusters, so larger values are probably better if you want to do clustering. Of course too large and you homogenize everything, so ... this is where one wants to do some exploratory work before doing the clustering (and of the resulting clustering) to provide some confidence that there aren't any significant pitfalls.

A low min_dist also tends to be better for clustering, since concentrating points together, while potentially bad for visualisation, is exactly what you want for clustering.

With regard to clustering parameters I would suggest it would be useful to use a low min_samples parameter and quite a large min_cluster_size. Once again, this is something you want verify with some exploratory work on the clusters you get out.

In fun news I think I can now describe HDBSCAN in the same primitives as UMAP, so the two may be more connected that one might think.

arita37 · 2018-09-22T06:55:23Z

Another question, what about the
dimension of the embedding for clustering ?

Can we use higher than 2 ? like 3,4 ....
Anumy impact on the clustering ?

arnaud-nt2i · 2021-04-08T10:42:25Z

edit: dumb question:

For the sake of interpreting the results, if we use Umap to reduce dimensionality before clustering, is it possible to retrieve the original labels of points after clustering?
Put another way: what the point of clustering data on UMAP subspace since the subspace vectors cannot be interpreted?

lmcinnes · 2021-04-09T20:30:42Z

Since order of samples is preserved under UMAP and then clustering, you can assign cluster labels directly to the original source data and interpret clusters there -- this would be the recommended approach really.

gabritaglia closed this as completed Nov 23, 2017

gabritaglia reopened this Nov 23, 2017

gabritaglia closed this as completed Nov 23, 2017

lmcinnes mentioned this issue Mar 9, 2018

hdbscan on UMAP subspace #57

Open

cornhundred mentioned this issue Jul 5, 2020

visualize larger matrices ismms-himc/clustergrammer2#73

Open

djanders19 mentioned this issue Apr 15, 2021

Effects of Different Clustering Algorithms YttriLab/B-SOID#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Clustering on UMAP output #25

[Question] Clustering on UMAP output #25

gabritaglia commented Nov 23, 2017 •

edited

Loading

lmcinnes commented Nov 23, 2017 •

edited

Loading

gabritaglia commented Nov 23, 2017

kgullikson88 commented Dec 13, 2017

lmcinnes commented Dec 13, 2017

arita37 commented Sep 22, 2018

arnaud-nt2i commented Apr 8, 2021 •

edited

Loading

lmcinnes commented Apr 9, 2021

[Question] Clustering on UMAP output #25

[Question] Clustering on UMAP output #25

Comments

gabritaglia commented Nov 23, 2017 • edited Loading

lmcinnes commented Nov 23, 2017 • edited Loading

gabritaglia commented Nov 23, 2017

kgullikson88 commented Dec 13, 2017

lmcinnes commented Dec 13, 2017

arita37 commented Sep 22, 2018

arnaud-nt2i commented Apr 8, 2021 • edited Loading

lmcinnes commented Apr 9, 2021

gabritaglia commented Nov 23, 2017 •

edited

Loading

lmcinnes commented Nov 23, 2017 •

edited

Loading

arnaud-nt2i commented Apr 8, 2021 •

edited

Loading