-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Why does scikit HDBSCAN return different results when we compare with CuML's HDBSCAN? #4723
Comments
@jcfaracco the first intuition I have is that your |
This issue has been labeled |
This issue has been labeled |
I have the same question here. In my case, the min_samples=2 and min_cluster_size = 2. |
@garyhsu29 , does this still happen if you increase the value of your hyperparameters? Are you using the same data as in the example above? |
@beckernick today I did an experiment with the HDBSCAN in scikit-learn. I got the same results. I see some inconsistencies with 3 datasets at least (I used the same as the example). I clearly see how the hyperparameters matter, but the point is how the same hyperparameters cause different results when we fit scikit's HDBSCAN and RAPIDS' HDBSCAN. For me, it is fine to have some inconsistencies between CPU and GPU versions depending on how the algorithm was implemented, but I wonder why technically. |
@jcfaracco can you share some more information about the differences you are seeing? Are you seeing completely different clusterings or are there specific points that are showing up in some clusters? Are points being grouped together similarly but with different cluster labels assigned to them? There are several factors of varying implmentations that can cause two different implementations to yield results which are correct yet still different. First, the minimum spanning trees themselves can be approximate and I would not expect an approximate algorithm to yield the exact same results in two different implementations. You should be able to drag and drop images into the comment window of Github. It would be great if you could share some images, or at least a rough description of the differences you are seeing. |
@cjnolet here is a visual overview of the two versions (including the original dataset and the diff): The Diff plot contains some classes in yellow, orange, and light blue that show the diffs between CPU and GPU versions. In regular blue, we have the same classification. |
What is your question?
Hello all,
I'm trying to validate both HDBSCAN's and I'm getting a weird result.
To explain it better, I'm gonna show you a simple code that proves the differences between them.
I really don't know if I'm making any mistake, if it is a bug or a missing feature, or if it is even working as designed.
I would love if I could share the plots I'm getting, but I cannot attach images here.
I read the API paragraph that mentions some variance between both versions but small ones and not significant variances like I'm seeing:
Note that while the algorithm is generally deterministic and should provide matching results between RAPIDS and the Scikit-learn Contrib versions, the construction of the k-nearest neighbors graph and minimum spanning tree can introduce differences between the two algorithms, especially when several nearest neighbors around a point might have the same distance. While the differences in the minimum spanning trees alone might be subtle, they can (and often will) lead to some points being assigned different cluster labels between the two implementations.
I also read the HDBSCAN feature request here which explain some points of the implementation: #1783
If you have any recommendation or guideline to avoid this variation I would be glad.
I think that we should be able to validate both versions even if CuML's HDBSCAN has less features than the scikit version.
The text was updated successfully, but these errors were encountered: