Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Kmeans auto-find K #825

Closed
cjnolet opened this issue Sep 14, 2022 · 2 comments · Fixed by #1070
Closed

[FEA] Kmeans auto-find K #825

cjnolet opened this issue Sep 14, 2022 · 2 comments · Fixed by #1070
Labels
feature request New feature or request inactive-30d

Comments

@cjnolet
Copy link
Member

cjnolet commented Sep 14, 2022

Some important workflows require the ability to auto-find k using a measure of residual (spread of point distances across all centroids) and dispersion (spread of centroids in relation to each other).

This requires an objective which maximizes the cluster to cluster distances while minimizing the point to cluster spread as much as possible. We should be able to do this fairly easily, especially with our new consolidated k-means implementations.

@cjnolet cjnolet added the feature request New feature or request label Sep 14, 2022
@dantegd
Copy link
Member

dantegd commented Sep 14, 2022

Note: there was an issue opened by @jeaton32 a while ago in cuML with some code: rapidsai/cuml#818

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Feb 21, 2023
This is a port of rapidsai/cuml#818 (originally from NVGraph) which uses the Calinski-Harabasz score 
to find the optimal value of k. 

Todo:
- [x] create histogram of cluster sizes
- [x] add googletests 
- [x] expose public API

Closes #825


cc @jeaton32

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Ben Frederickson (https://github.com/benfred)

URL: #1070
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request inactive-30d
Projects
Development

Successfully merging a pull request may close this issue.

2 participants