Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cosine distance is broken on DBSCAN #4938

Open
royinx opened this issue Oct 20, 2022 · 1 comment
Open

[BUG] cosine distance is broken on DBSCAN #4938

royinx opened this issue Oct 20, 2022 · 1 comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@royinx
Copy link

royinx commented Oct 20, 2022

Describe the bug

DBSCAN with cosine distance cannot perform clustering.
Line print(DBSCAN_cosine.core_sample_indices_) returns []
Please reproduce the issues with the following code.


Also, thanks to #212 giving me the idea to debug.
Since Euclidean distance is used currently, cosine distance can be supported today by normalizing your vectors to unit norm. In the meantime, we can certainly work to add cosine & L1 distance.

While i don't know what eps is the best when unit norm is used.
After several attempts , seems eps=0.7 for euclidean is the closest result to DBSCAN_cosine.
anyone can tell me the true value of eps transforming from cosine distance to euclidean distance ?


Steps/Code to reproduce bug
Download testing data
data.npy
data_(84,512)_3clusters.npy

from sklearn import cluster
import numpy as np
import cuml
import cupy as cp


def main():
    # init data
    arr= np.load("data.npy")
    arr_norm = arr/np.linalg.norm(arr, axis=1, keepdims=True)
    arr_norm[np.isnan(arr_norm)] = 0

    # =================================== DBSCAN ===================================
    DBSCAN_cosine = cluster.DBSCAN(min_samples=5, eps=0.25, metric="cosine").fit(arr)
    labels = DBSCAN_cosine.labels_.tolist()
    print("sklean \t DBSCAN \t cosine \t", labels)

    # =================================== Optics ===================================
    optics_cosine = cluster.OPTICS(min_samples=5, max_eps=0.25, metric="cosine", cluster_method="dbscan").fit(arr)
    labels = optics_cosine.labels_.tolist()
    print("sklean \t optics \t cosine \t", labels)

    optics_euclidean = cluster.OPTICS(min_samples=5, max_eps=0.7, metric="euclidean", cluster_method="dbscan").fit(arr_norm)
    labels = optics_euclidean.labels_.tolist()
    print("sklean \t optics \t euclidean \t", labels)

    # =================================== cuML ===================================

    arr_norm = cp.array(arr_norm)
    DBSCAN_euclidean = cuml.DBSCAN(eps=0.7, min_samples=5, metric="euclidean", output_type="cupy")
    DBSCAN_euclidean.fit(arr_norm)
    labels_ = DBSCAN_euclidean.labels_
    labels_ = list(cp.asnumpy(labels_))
    print("cuML \t DBSCAN \t euclidean \t", labels_)


    arr = cp.asarray(arr)
    DBSCAN_cosine = cuml.DBSCAN(eps=0.25, min_samples=5, metric="cosine", output_type="cupy")
    DBSCAN_cosine.fit(arr)
    labels_ = DBSCAN_cosine.labels_
    labels_ = list(cp.asnumpy(labels_))
    print("cuML \t DBSCAN \t cosine \t", labels_)

    print(DBSCAN_euclidean.core_sample_indices_)
    print(DBSCAN_cosine.core_sample_indices_)

if __name__ == "__main__":
    main()
@georgeliu95
Copy link

Hey @royinx, I think this issue is caused by #5360.
I reproduce it as you described:

sklean 	 DBSCAN 	 cosine 	 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
sklean 	 optics 	 cosine 	 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
sklean 	 optics 	 euclidean 	 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
cuML 	 DBSCAN 	 euclidean 	 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
cuML 	 DBSCAN 	 cosine 	 [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 20 21 22 23 24 25 26
 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 51 52 53 54 55 56 57 58 59]
[]

After fixing it, you will get:

sklean 	 DBSCAN 	 cosine 	 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
sklean 	 optics 	 cosine 	 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
sklean 	 optics 	 euclidean 	 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
cuML 	 DBSCAN 	 euclidean 	 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
cuML 	 DBSCAN 	 cosine 	 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 20 21 22 23 24 25 26
 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 51 52 53 54 55 56 57 58 59]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 20 21 22 23 24 25 26
 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 51 52 53 54 55 56 57 58 59]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants