Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Need a approximate_predict function for cuml HDBSCAN #4448

Closed
sudhanshu-shukla-git opened this issue Dec 14, 2021 · 14 comments · Fixed by #4872
Closed

[FEA] Need a approximate_predict function for cuml HDBSCAN #4448

sudhanshu-shukla-git opened this issue Dec 14, 2021 · 14 comments · Fixed by #4872
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@sudhanshu-shukla-git
Copy link

sudhanshu-shukla-git commented Dec 14, 2021

Is your feature request related to a problem? Please describe.
I wish I could use cuML HDBSCAN to do predicting the clusters from the existing model, similar to the scikit-learn's approximate_predict

Describe the solution you'd like
Similar to scikit-learn HDBSCAN's approximate_predict
https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict

Predict the cluster label of new points. The returned labels will be those of the original clustering found by clusterer, and therefore are not (necessarily) the cluster labels that would be found by clustering the original data combined with points_to_predict, hence the ‘approximate’ label.

Describe alternatives you've considered

There is a CPU based solution available already by Scikit, but need a GPU based solution.

https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict

@sudhanshu-shukla-git sudhanshu-shukla-git added ? - Needs Triage Need team to review and classify feature request New feature or request labels Dec 14, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sudhanshu-shukla-git
Copy link
Author

@cjnolet Do we have any updates on this feature? When can we expect this to be released?

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@whymauri
Copy link

whymauri commented May 2, 2022

(commenting to maintain the issue as active)

@github-actions
Copy link

github-actions bot commented Jun 2, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sudhanshu-shukla-git
Copy link
Author

(commenting to maintain the issue as active)

@cedivad
Copy link

cedivad commented Jun 17, 2022

I was also looking for this feature. I assume the models aren't binary-compatible and we can't use a model created by cuml for say scikit-learn's approximate_predict?

@whymauri
Copy link

Technically you can extract the required datastructures from RAPIDS and inject them into SKLearn's HierarchicalLabelTree.

But you will have to do a lot of implementation on your own end.

@osalem-l
Copy link

Technically you can extract the required datastructures from RAPIDS and inject them into SKLearn's HierarchicalLabelTree.

But you will have to do a lot of implementation on your own end.

Could you illustrate how? currently I'm trying to figure this out

@cedivad
Copy link

cedivad commented Jun 20, 2022

I've looked at SKLearn's implementation and it seems they are using a brute force approach, calculating distances to each centroid one by one. On a GPU, I'm thinking yes, you could parallelise the distance calculations but you would still need to check the results one by one. Best case you would spawn a "binary tree" of checking threads. I believe this is a task that isn't very parallelizable, and maybe that's why it was de-prioritized?

If so we only need to extract the centroids from RAPIDS and use them in whatever code we want, say a small go http server for inference of new vectors.

@RaiAmanRai
Copy link

RaiAmanRai commented Aug 1, 2022

Hi @cjnolet @divyegala any updates on this feature, or any appromimate timeline when this will roll out.

Would really appreciate the work.

@ldsands
Copy link

ldsands commented Aug 3, 2022

Hi @cjnolet @divyegala any updates on this feature, or any appromimate timeline when this will roll out.

Would really appreciate the work.

Does this pull request not add this feature? I haven't dived in deep to see but just glancing it looks like it does. At the very least, this pull request is needed before the approximate_predict feature can be implemented.

@cjnolet
Copy link
Member

cjnolet commented Aug 3, 2022

@sudhanshu-shukla-git @RaiAmanRai

Does #4800 not add this feature? I haven't dived in deep to see but just glancing it looks like it does. At the very least, this pull request is needed before the approximate_predict feature can be implemented.

That pull request implements the needed pieces for fuzzy clustering, which is a stepping stone towards out of sample prediction (approximate_predict). We're working towards the approximate predict.

@DeepTitan
Copy link

I second the need for this feature, would really help in my project

rapids-bot bot pushed a commit that referenced this issue Sep 3, 2022
PR for HDBSCAN approximate_predict

- [x] Building cluster_map
- [x] Modifying PredictionData class
- [x] Obtaining nearest neighbor in MR space
- [x] Computing probability
- [x] Tests

Closes #4877
Closes #4448

Authors:
  - Tarang Jain (https://github.com/tarang-jain)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #4872
jakirkham pushed a commit to jakirkham/cuml that referenced this issue Feb 27, 2023
PR for HDBSCAN approximate_predict

- [x] Building cluster_map
- [x] Modifying PredictionData class
- [x] Obtaining nearest neighbor in MR space
- [x] Computing probability
- [x] Tests

Closes rapidsai#4877
Closes rapidsai#4448

Authors:
  - Tarang Jain (https://github.com/tarang-jain)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4872
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants