-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance cuML benchmark utility and refactor hdbscan import utilities #5242
Enhance cuML benchmark utility and refactor hdbscan import utilities #5242
Conversation
python/cuml/benchmark/algorithms.py
Outdated
@@ -69,6 +69,10 @@ | |||
import umap | |||
|
|||
|
|||
if has_hdbscan_prediction(raise_if_unavailable=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our import utilities have two hdbscan availability checks. I don't believe the prediction
namespace is optional in HDBSCAN, so I've opted to use this as a placeholder. If neither the prediction
or plots
namespaces are optional, we can probably unify these utilities into a single has_hdbscan
like we have for other libraries (and customize the raised error in hdbscan.pyx).
Alternatively, I can add a has_hdbscan
in this PR and use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think has_hdbscan makes more sense. Initially we only cared that the plotting package was available so it was named accordingly but since then we've added the prediction and we only really care that hdbscan itself is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Would you prefer I open a separate PR to refactor or throw it into this one? Will add the utility and refactor hdbscan.pyx here.
rerun tests |
/merge |
rerun tests |
This PR makes several small changes:
LinearSVC
andLinearSVR
to the cuML benchmarks. Currently, we run SVC/SVR(linear) to benchmark a linear SVM. The scikit-learn documentation recommends using LinearSVC for large datasets instead for performance reasons. For even 10,000 records, the performance difference is quite significant. As the model quality can differ slightly between SVC(linear) and LinearSVC, we add LinearSVC rather than replace SVC(linear).Adds HDBSCAN to the benchmarks
Updates
RandomForest{Classifier, Regressor}
to use all CPU cores on the machine and to train more than 10 trees. The scikit-learn implementation benefits significantly from using multiple cores, but the benefit is capped by the number of trees. On large machines, using only 10 trees will bias toward slower performance relative to what's possible. As it's rare for people to train Random Forests with only 10 trees, this is changed to a more reasonable (but small) number of 50 trees.Updates RandomForestClassifier to use
max_features="sqrt"
rather than 1.0. This is generally regarded as the appropriate default setting (used in scikit-learn and noted in Hastie's ESL). Using 1.0 as max features takes significantly longer to train on the CPU and results in more correlated trees, which is not expected to improve results. As a result, it's not the ideal "default" characterization of performance.Refactors the HDBSCAN import utilities into a single
has_hdbscan
utility now that we use more of the CPU library in different areas.This replaces #5165