-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Need a approximate_predict function for cuml HDBSCAN #4448
Comments
This issue has been labeled |
@cjnolet Do we have any updates on this feature? When can we expect this to be released? |
This issue has been labeled |
(commenting to maintain the issue as active) |
This issue has been labeled |
(commenting to maintain the issue as active) |
I was also looking for this feature. I assume the models aren't binary-compatible and we can't use a model created by cuml for say scikit-learn's approximate_predict? |
Technically you can extract the required datastructures from RAPIDS and inject them into SKLearn's HierarchicalLabelTree. But you will have to do a lot of implementation on your own end. |
Could you illustrate how? currently I'm trying to figure this out |
I've looked at SKLearn's implementation and it seems they are using a brute force approach, calculating distances to each centroid one by one. On a GPU, I'm thinking yes, you could parallelise the distance calculations but you would still need to check the results one by one. Best case you would spawn a "binary tree" of checking threads. I believe this is a task that isn't very parallelizable, and maybe that's why it was de-prioritized? If so we only need to extract the centroids from RAPIDS and use them in whatever code we want, say a small go http server for inference of new vectors. |
Hi @cjnolet @divyegala any updates on this feature, or any appromimate timeline when this will roll out. Would really appreciate the work. |
Does this pull request not add this feature? I haven't dived in deep to see but just glancing it looks like it does. At the very least, this pull request is needed before the approximate_predict feature can be implemented. |
@sudhanshu-shukla-git @RaiAmanRai
That pull request implements the needed pieces for fuzzy clustering, which is a stepping stone towards out of sample prediction (approximate_predict). We're working towards the approximate predict. |
I second the need for this feature, would really help in my project |
PR for HDBSCAN approximate_predict - [x] Building cluster_map - [x] Modifying PredictionData class - [x] Obtaining nearest neighbor in MR space - [x] Computing probability - [x] Tests Closes #4877 Closes #4448 Authors: - Tarang Jain (https://github.com/tarang-jain) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #4872
PR for HDBSCAN approximate_predict - [x] Building cluster_map - [x] Modifying PredictionData class - [x] Obtaining nearest neighbor in MR space - [x] Computing probability - [x] Tests Closes rapidsai#4877 Closes rapidsai#4448 Authors: - Tarang Jain (https://github.com/tarang-jain) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4872
Is your feature request related to a problem? Please describe.
I wish I could use cuML HDBSCAN to do predicting the clusters from the existing model, similar to the scikit-learn's approximate_predict
Describe the solution you'd like
Similar to scikit-learn HDBSCAN's approximate_predict
https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict
Predict the cluster label of new points. The returned labels will be those of the original clustering found by clusterer, and therefore are not (necessarily) the cluster labels that would be found by clustering the original data combined with points_to_predict, hence the ‘approximate’ label.
Describe alternatives you've considered
There is a CPU based solution available already by Scikit, but need a GPU based solution.
https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict
The text was updated successfully, but these errors were encountered: