-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Kmeans auto-find K #825
Comments
Note: there was an issue opened by @jeaton32 a while ago in cuML with some code: rapidsai/cuml#818 |
This issue has been labeled |
This is a port of rapidsai/cuml#818 (originally from NVGraph) which uses the Calinski-Harabasz score to find the optimal value of k. Todo: - [x] create histogram of cluster sizes - [x] add googletests - [x] expose public API Closes #825 cc @jeaton32 Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Ben Frederickson (https://github.com/benfred) URL: #1070
Some important workflows require the ability to auto-find k using a measure of residual (spread of point distances across all centroids) and dispersion (spread of centroids in relation to each other).
This requires an objective which maximizes the cluster to cluster distances while minimizing the point to cluster spread as much as possible. We should be able to do this fairly easily, especially with our new consolidated k-means implementations.
The text was updated successfully, but these errors were encountered: