[QST] Why does scikit HDBSCAN return different results when we compare with CuML's HDBSCAN? #4723

jcfaracco · 2022-05-04T22:31:40Z

What is your question?

Hello all,

I'm trying to validate both HDBSCAN's and I'm getting a weird result.
To explain it better, I'm gonna show you a simple code that proves the differences between them.
I really don't know if I'm making any mistake, if it is a bug or a missing feature, or if it is even working as designed.

import os
import pickle
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs as make_blobs_cpu

from hdbscan import HDBSCAN as HDBSCAN_CPU
from cuml.cluster import HDBSCAN as HDBSCAN_GPU

np.random.seed(11)

sns.set_context('poster')
sns.set_style('white')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}

blobs_file = 'blobs.pickle'

if not os.path.exists(blobs_file):
    blobs, _ = make_blobs_cpu(n_samples=4000, centers=[(-0.75,2.25), (1.0, 2.0), (1.0, 1.0), (2.0, -0.5), (-1.0, -1.0), (0.0, 0.0)], cluster_std=0.5)
    test_data = np.vstack([blobs])

    with open(blobs_file, 'wb') as handle:
        pickle.dump(test_data, handle, protocol=pickle.HIGHEST_PROTOCOL)
else:
    with open(blobs_file, 'rb') as handle:
        test_data = pickle.load(handle)
        
plt.scatter(test_data.T[0], test_data.T[1], color='b', **plot_kwds)

clusterer = HDBSCAN_CPU(min_samples=1, min_cluster_size=100)

clusterer.fit(test_data)

palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
                  if col < len(palette) else (0.5, 0.5, 0.5) for col, sat in
                  zip(clusterer.labels_, clusterer.probabilities_)]
plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)

clusterer_gpu = HDBSCAN_GPU(min_samples=1, min_cluster_size=100)

clusterer_gpu.fit(test_data)

palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
                  if col < len(palette) else (0.5, 0.5, 0.5) for col, sat in
                  zip(clusterer_gpu.labels_, clusterer_gpu.probabilities_)]
plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)

I would love if I could share the plots I'm getting, but I cannot attach images here.

I read the API paragraph that mentions some variance between both versions but small ones and not significant variances like I'm seeing:

Note that while the algorithm is generally deterministic and should provide matching results between RAPIDS and the Scikit-learn Contrib versions, the construction of the k-nearest neighbors graph and minimum spanning tree can introduce differences between the two algorithms, especially when several nearest neighbors around a point might have the same distance. While the differences in the minimum spanning trees alone might be subtle, they can (and often will) lead to some points being assigned different cluster labels between the two implementations.

I also read the HDBSCAN feature request here which explain some points of the implementation: #1783

If you have any recommendation or guideline to avoid this variation I would be glad.
I think that we should be able to validate both versions even if CuML's HDBSCAN has less features than the scikit version.

divyegala · 2022-05-07T00:06:13Z

@jcfaracco the first intuition I have is that your min_samples is really low. Can you try increasing it? If your data is really dense, it is possible that the first neighbor (because min_samples=1) may be found differently in the kNN step just through floating point error

github-actions · 2022-06-06T00:08:42Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-09-04T00:11:24Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

garyhsu29 · 2022-12-19T15:22:35Z

I have the same question here. In my case, the min_samples=2 and min_cluster_size = 2.
The cuml HDBSCAN yield a very different result than the CPU version of HDBSCAN.

beckernick · 2023-01-03T15:24:31Z

@garyhsu29 , does this still happen if you increase the value of your hyperparameters? Are you using the same data as in the example above?

jcfaracco · 2024-07-16T18:35:52Z

@beckernick today I did an experiment with the HDBSCAN in scikit-learn. I got the same results. I see some inconsistencies with 3 datasets at least (I used the same as the example). I clearly see how the hyperparameters matter, but the point is how the same hyperparameters cause different results when we fit scikit's HDBSCAN and RAPIDS' HDBSCAN. For me, it is fine to have some inconsistencies between CPU and GPU versions depending on how the algorithm was implemented, but I wonder why technically.

cjnolet · 2024-07-16T19:03:57Z

@jcfaracco can you share some more information about the differences you are seeing? Are you seeing completely different clusterings or are there specific points that are showing up in some clusters? Are points being grouped together similarly but with different cluster labels assigned to them?

There are several factors of varying implmentations that can cause two different implementations to yield results which are correct yet still different. First, the minimum spanning trees themselves can be approximate and I would not expect an approximate algorithm to yield the exact same results in two different implementations.

You should be able to drag and drop images into the comment window of Github. It would be great if you could share some images, or at least a rough description of the differences you are seeing.

jcfaracco · 2024-07-16T19:23:10Z

@cjnolet here is a visual overview of the two versions (including the original dataset and the diff):

The Diff plot contains some classes in yellow, orange, and light blue that show the diffs between CPU and GPU versions. In regular blue, we have the same classification.

jcfaracco added ? - Needs Triage Need team to review and classify question Further information is requested labels May 4, 2022

github-actions bot added the inactive-30d label Jun 6, 2022

github-actions bot added the inactive-90d label Sep 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Why does scikit HDBSCAN return different results when we compare with CuML's HDBSCAN? #4723

[QST] Why does scikit HDBSCAN return different results when we compare with CuML's HDBSCAN? #4723

jcfaracco commented May 4, 2022 •

edited

Loading

divyegala commented May 7, 2022

github-actions bot commented Jun 6, 2022

github-actions bot commented Sep 4, 2022

garyhsu29 commented Dec 19, 2022 •

edited

Loading

beckernick commented Jan 3, 2023

jcfaracco commented Jul 16, 2024 •

edited

Loading

cjnolet commented Jul 16, 2024

jcfaracco commented Jul 16, 2024 •

edited

Loading

[QST] Why does scikit HDBSCAN return different results when we compare with CuML's HDBSCAN? #4723

[QST] Why does scikit HDBSCAN return different results when we compare with CuML's HDBSCAN? #4723

Comments

jcfaracco commented May 4, 2022 • edited Loading

divyegala commented May 7, 2022

github-actions bot commented Jun 6, 2022

github-actions bot commented Sep 4, 2022

garyhsu29 commented Dec 19, 2022 • edited Loading

beckernick commented Jan 3, 2023

jcfaracco commented Jul 16, 2024 • edited Loading

cjnolet commented Jul 16, 2024

jcfaracco commented Jul 16, 2024 • edited Loading

jcfaracco commented May 4, 2022 •

edited

Loading

garyhsu29 commented Dec 19, 2022 •

edited

Loading

jcfaracco commented Jul 16, 2024 •

edited

Loading

jcfaracco commented Jul 16, 2024 •

edited

Loading