Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Why does scikit HDBSCAN return different results when we compare with CuML's HDBSCAN? #4723

Open
jcfaracco opened this issue May 4, 2022 · 8 comments
Labels
? - Needs Triage Need team to review and classify inactive-30d inactive-90d question Further information is requested

Comments

@jcfaracco
Copy link
Contributor

jcfaracco commented May 4, 2022

What is your question?

Hello all,

I'm trying to validate both HDBSCAN's and I'm getting a weird result.
To explain it better, I'm gonna show you a simple code that proves the differences between them.
I really don't know if I'm making any mistake, if it is a bug or a missing feature, or if it is even working as designed.

import os
import pickle
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs as make_blobs_cpu

from hdbscan import HDBSCAN as HDBSCAN_CPU
from cuml.cluster import HDBSCAN as HDBSCAN_GPU

np.random.seed(11)

sns.set_context('poster')
sns.set_style('white')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}

blobs_file = 'blobs.pickle'

if not os.path.exists(blobs_file):
    blobs, _ = make_blobs_cpu(n_samples=4000, centers=[(-0.75,2.25), (1.0, 2.0), (1.0, 1.0), (2.0, -0.5), (-1.0, -1.0), (0.0, 0.0)], cluster_std=0.5)
    test_data = np.vstack([blobs])

    with open(blobs_file, 'wb') as handle:
        pickle.dump(test_data, handle, protocol=pickle.HIGHEST_PROTOCOL)
else:
    with open(blobs_file, 'rb') as handle:
        test_data = pickle.load(handle)
        
plt.scatter(test_data.T[0], test_data.T[1], color='b', **plot_kwds)

clusterer = HDBSCAN_CPU(min_samples=1, min_cluster_size=100)

clusterer.fit(test_data)

palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
                  if col < len(palette) else (0.5, 0.5, 0.5) for col, sat in
                  zip(clusterer.labels_, clusterer.probabilities_)]
plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)

clusterer_gpu = HDBSCAN_GPU(min_samples=1, min_cluster_size=100)

clusterer_gpu.fit(test_data)

palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
                  if col < len(palette) else (0.5, 0.5, 0.5) for col, sat in
                  zip(clusterer_gpu.labels_, clusterer_gpu.probabilities_)]
plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)

I would love if I could share the plots I'm getting, but I cannot attach images here.

I read the API paragraph that mentions some variance between both versions but small ones and not significant variances like I'm seeing:

Note that while the algorithm is generally deterministic and should provide matching results between RAPIDS and the Scikit-learn Contrib versions, the construction of the k-nearest neighbors graph and minimum spanning tree can introduce differences between the two algorithms, especially when several nearest neighbors around a point might have the same distance. While the differences in the minimum spanning trees alone might be subtle, they can (and often will) lead to some points being assigned different cluster labels between the two implementations.

I also read the HDBSCAN feature request here which explain some points of the implementation: #1783

If you have any recommendation or guideline to avoid this variation I would be glad.
I think that we should be able to validate both versions even if CuML's HDBSCAN has less features than the scikit version.

@jcfaracco jcfaracco added ? - Needs Triage Need team to review and classify question Further information is requested labels May 4, 2022
@divyegala
Copy link
Member

@jcfaracco the first intuition I have is that your min_samples is really low. Can you try increasing it? If your data is really dense, it is possible that the first neighbor (because min_samples=1) may be found differently in the kNN step just through floating point error

@github-actions
Copy link

github-actions bot commented Jun 6, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Sep 4, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@garyhsu29
Copy link

garyhsu29 commented Dec 19, 2022

I have the same question here. In my case, the min_samples=2 and min_cluster_size = 2.
The cuml HDBSCAN yield a very different result than the CPU version of HDBSCAN.

@beckernick
Copy link
Member

@garyhsu29 , does this still happen if you increase the value of your hyperparameters? Are you using the same data as in the example above?

@jcfaracco
Copy link
Contributor Author

jcfaracco commented Jul 16, 2024

@beckernick today I did an experiment with the HDBSCAN in scikit-learn. I got the same results. I see some inconsistencies with 3 datasets at least (I used the same as the example). I clearly see how the hyperparameters matter, but the point is how the same hyperparameters cause different results when we fit scikit's HDBSCAN and RAPIDS' HDBSCAN. For me, it is fine to have some inconsistencies between CPU and GPU versions depending on how the algorithm was implemented, but I wonder why technically.

@cjnolet
Copy link
Member

cjnolet commented Jul 16, 2024

@jcfaracco can you share some more information about the differences you are seeing? Are you seeing completely different clusterings or are there specific points that are showing up in some clusters? Are points being grouped together similarly but with different cluster labels assigned to them?

There are several factors of varying implmentations that can cause two different implementations to yield results which are correct yet still different. First, the minimum spanning trees themselves can be approximate and I would not expect an approximate algorithm to yield the exact same results in two different implementations.

You should be able to drag and drop images into the comment window of Github. It would be great if you could share some images, or at least a rough description of the differences you are seeing.

@jcfaracco
Copy link
Contributor Author

jcfaracco commented Jul 16, 2024

@cjnolet here is a visual overview of the two versions (including the original dataset and the diff):

image

The Diff plot contains some classes in yellow, orange, and light blue that show the diffs between CPU and GPU versions. In regular blue, we have the same classification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify inactive-30d inactive-90d question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants