-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Clustering on UMAP output #25
Comments
I would certainly not apply K-Means to the results of UMAP (or t-SNE) output (since they rarely provide nice spherical clusters). On the other hand I feel that the linked answer is perhaps too cautious -- I don't think you can't apply a density based clustering algorithm to the results of t-SNE so much as that one needs to be careful in interpreting the results. t-SNE can certainly "create" sub-clusters that aren't entirely there (by separating parts of a cluster), and t-SNE does certainly discard some density information, so again, care is needed. In this sense I believe it is perfectly acceptable to perform clustering on the result providing you are going to submit the clusters to further analysis and verification. As long as you are not simply taking the results of clustering at face value (and you shouldn't really ever do that anyway) then the results can provide useful information about your data. Now, having said all of that: UMAP does offer some improvements over t-SNE on this front. It is significantly less likely to create sub-clusters in the way t-SNE does, and it will do a better job of preserving density (though far from perfect, and requires small If you want evidence that this can work, using HDBSCAN on a UMAP embedding of the MNIST digits dataset (with suitable parameter choices for each algorithm) gave me an ARI of 0.92, which is remarkably good for a purely unsupervised approach, and is clearly capturing real information about the data. My biggest caveat is with regard to noise in the data: UMAP and t-SNE will both tend to contract noise into clusters. If you have noisy data then UMAP and t-SNE will hide that from you, so it pays to have some awareness of what your data is like before just trusting a clustering (again, as is true of all clustering). |
Thank you so much for the deep answer. Very useful! |
I have a related question: My intuition suggests using large |
It is certainly true that small A low With regard to clustering parameters I would suggest it would be useful to use a low In fun news I think I can now describe HDBSCAN in the same primitives as UMAP, so the two may be more connected that one might think. |
Another question, what about the Can we use higher than 2 ? like 3,4 .... |
edit: dumb question:
|
Since order of samples is preserved under UMAP and then clustering, you can assign cluster labels directly to the original source data and interpret clusters there -- this would be the recommended approach really. |
Hi,
when using tSNE, it is usually not recommended to perform clustering on the "reduced space" with algorithms such as k-means or DBSCAN (and HDBSCAN?) because the dimensionality reduction applied by tSNE doesn't keep properties like relative distance and density (see https://stats.stackexchange.com/questions/263539/k-means-clustering-on-the-output-of-t-sne).
Would it make sense to perform such clustering (with k-means, DBSCAN, HDBSCAN etc.) on the UMAP output?
Thank you very much.
The text was updated successfully, but these errors were encountered: