-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] AgglomerativeClustering: Duplicate data samples cause incorrect cluster assignments #3801
Comments
I believe I know the cause of this bug. We assume absolute distance of 0 is on the diagonal in the pairwise distance matrix and so we set it to the max for the MST to converge to the correct solution. A reasonable fix for this case would be to do this only for the diagonal elements in order to support duplicate data samples. Here's a small example of getting the correct solution by making the first two data samples slightly different. >>> features = np.array([[0.0, 0.0, 0.001], [0.0, 0.0, 0.002], [2.0, 2.0, 2.0]])
>>> AgglomerativeClustering(n_clusters=2).fit_predict(features)
Label prop iterations: 3
Iterations: 1
2068,40,24,6,66,134
n_edges: 2
Finished dendrogram
array([0, 0, 1], dtype=int32) |
…istances from self-loops (#3824) Closes #3801 Closes #3802 Corresponding RAFT PR: rapidsai/raft#217 Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #3824
…istances from self-loops (rapidsai#3824) Closes rapidsai#3801 Closes rapidsai#3802 Corresponding RAFT PR: rapidsai/raft#217 Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3824
Describe the bug
Getting unexpected clustering results from AgglomerativeClustering.
Steps/Code to reproduce bug
Expected behavior
Expecting outputs to be
[0, 0, 1]
or[1, 1, 0]
.Environment details (please complete the following information):
conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults rapids=0.19 python=3.7 cudatoolkit=10.2
as part of a docker build.The text was updated successfully, but these errors were encountered: