Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance cuML benchmark utility and refactor hdbscan import utilities #5242

Merged
merged 11 commits into from
Mar 6, 2023

Conversation

beckernick
Copy link
Member

@beckernick beckernick commented Feb 17, 2023

This PR makes several small changes:

  • Adds LinearSVC and LinearSVR to the cuML benchmarks. Currently, we run SVC/SVR(linear) to benchmark a linear SVM. The scikit-learn documentation recommends using LinearSVC for large datasets instead for performance reasons. For even 10,000 records, the performance difference is quite significant. As the model quality can differ slightly between SVC(linear) and LinearSVC, we add LinearSVC rather than replace SVC(linear).
from sklearn.datasets import make_classification
from sklearn.svm import LinearSVC, SVC

X, y = make_classification(n_samples=30000, n_features=10)

clf = LinearSVC()
%time clf.fit(X,y)
print(clf.score(X,y))

clf = SVC(kernel="linear")
%time clf.fit(X,y)
print(clf.score(X,y))
CPU times: user 529 ms, sys: 4.09 ms, total: 534 ms
Wall time: 534 ms
0.9278
CPU times: user 5.23 s, sys: 115 ms, total: 5.35 s
Wall time: 5.35 s
0.9278
  • Adds HDBSCAN to the benchmarks

  • Updates RandomForest{Classifier, Regressor} to use all CPU cores on the machine and to train more than 10 trees. The scikit-learn implementation benefits significantly from using multiple cores, but the benefit is capped by the number of trees. On large machines, using only 10 trees will bias toward slower performance relative to what's possible. As it's rare for people to train Random Forests with only 10 trees, this is changed to a more reasonable (but small) number of 50 trees.

clf = RandomForestClassifier(n_estimators=2, n_jobs=1)
%time clf.fit(X,y)

clf = RandomForestClassifier(n_estimators=2, n_jobs=-1)
%time clf.fit(X,y)

clf = RandomForestClassifier(n_estimators=6, n_jobs=-1) # three times as many trees, same wall time
%time clf.fit(X,y)
CPU times: user 3.09 s, sys: 20.9 ms, total: 3.11 s
Wall time: 3.1 s
CPU times: user 3.14 s, sys: 8.51 ms, total: 3.14 s
Wall time: 1.76 s
CPU times: user 8.74 s, sys: 19.3 ms, total: 8.76 s
Wall time: 1.68 s
  • Updates RandomForestClassifier to use max_features="sqrt" rather than 1.0. This is generally regarded as the appropriate default setting (used in scikit-learn and noted in Hastie's ESL). Using 1.0 as max features takes significantly longer to train on the CPU and results in more correlated trees, which is not expected to improve results. As a result, it's not the ideal "default" characterization of performance.

  • Refactors the HDBSCAN import utilities into a single has_hdbscan utility now that we use more of the CPU library in different areas.

This replaces #5165

@beckernick beckernick added Cython / Python Cython or Python issue benchmarking non-breaking Non-breaking change labels Feb 17, 2023
@beckernick beckernick self-assigned this Feb 17, 2023
@@ -69,6 +69,10 @@
import umap


if has_hdbscan_prediction(raise_if_unavailable=False):
Copy link
Member Author

@beckernick beckernick Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our import utilities have two hdbscan availability checks. I don't believe the prediction namespace is optional in HDBSCAN, so I've opted to use this as a placeholder. If neither the prediction or plots namespaces are optional, we can probably unify these utilities into a single has_hdbscan like we have for other libraries (and customize the raised error in hdbscan.pyx).

Alternatively, I can add a has_hdbscan in this PR and use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think has_hdbscan makes more sense. Initially we only cared that the plotting package was available so it was named accordingly but since then we've added the prediction and we only really care that hdbscan itself is available.

Copy link
Member Author

@beckernick beckernick Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Would you prefer I open a separate PR to refactor or throw it into this one? Will add the utility and refactor hdbscan.pyx here.

@beckernick beckernick added the improvement Improvement / enhancement to an existing function label Feb 17, 2023
@beckernick beckernick marked this pull request as ready for review February 17, 2023 18:44
@beckernick beckernick requested a review from a team as a code owner February 17, 2023 18:44
@beckernick beckernick changed the title Update cuML benchmark utility (add Linear{SVC, SVR} and HDBSCAN and improve Random Forest fairness) Enhance cuML benchmark utility and refactor hdbscan import utilities Feb 21, 2023
@beckernick
Copy link
Member Author

rerun tests

@cjnolet
Copy link
Member

cjnolet commented Feb 25, 2023

/merge

@beckernick
Copy link
Member Author

rerun tests

@rapids-bot rapids-bot bot merged commit 96d9ffe into rapidsai:branch-23.04 Mar 6, 2023
@beckernick beckernick deleted the benchmark-updates branch March 16, 2023 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarking Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants