Enhance cuML benchmark utility and refactor hdbscan import utilities #5242

beckernick · 2023-02-17T18:03:49Z

This PR makes several small changes:

Adds LinearSVC and LinearSVR to the cuML benchmarks. Currently, we run SVC/SVR(linear) to benchmark a linear SVM. The scikit-learn documentation recommends using LinearSVC for large datasets instead for performance reasons. For even 10,000 records, the performance difference is quite significant. As the model quality can differ slightly between SVC(linear) and LinearSVC, we add LinearSVC rather than replace SVC(linear).

from sklearn.datasets import make_classification
from sklearn.svm import LinearSVC, SVC

X, y = make_classification(n_samples=30000, n_features=10)

clf = LinearSVC()
%time clf.fit(X,y)
print(clf.score(X,y))

clf = SVC(kernel="linear")
%time clf.fit(X,y)
print(clf.score(X,y))
CPU times: user 529 ms, sys: 4.09 ms, total: 534 ms
Wall time: 534 ms
0.9278
CPU times: user 5.23 s, sys: 115 ms, total: 5.35 s
Wall time: 5.35 s
0.9278

Adds HDBSCAN to the benchmarks
Updates RandomForest{Classifier, Regressor} to use all CPU cores on the machine and to train more than 10 trees. The scikit-learn implementation benefits significantly from using multiple cores, but the benefit is capped by the number of trees. On large machines, using only 10 trees will bias toward slower performance relative to what's possible. As it's rare for people to train Random Forests with only 10 trees, this is changed to a more reasonable (but small) number of 50 trees.

clf = RandomForestClassifier(n_estimators=2, n_jobs=1)
%time clf.fit(X,y)

clf = RandomForestClassifier(n_estimators=2, n_jobs=-1)
%time clf.fit(X,y)

clf = RandomForestClassifier(n_estimators=6, n_jobs=-1) # three times as many trees, same wall time
%time clf.fit(X,y)
CPU times: user 3.09 s, sys: 20.9 ms, total: 3.11 s
Wall time: 3.1 s
CPU times: user 3.14 s, sys: 8.51 ms, total: 3.14 s
Wall time: 1.76 s
CPU times: user 8.74 s, sys: 19.3 ms, total: 8.76 s
Wall time: 1.68 s

Updates RandomForestClassifier to use max_features="sqrt" rather than 1.0. This is generally regarded as the appropriate default setting (used in scikit-learn and noted in Hastie's ESL). Using 1.0 as max features takes significantly longer to train on the CPU and results in more correlated trees, which is not expected to improve results. As a result, it's not the ideal "default" characterization of performance.
Refactors the HDBSCAN import utilities into a single has_hdbscan utility now that we use more of the CPU library in different areas.

This replaces #5165

beckernick · 2023-02-17T18:08:26Z

python/cuml/benchmark/algorithms.py

@@ -69,6 +69,10 @@
    import umap


+if has_hdbscan_prediction(raise_if_unavailable=False):


Our import utilities have two hdbscan availability checks. I don't believe the prediction namespace is optional in HDBSCAN, so I've opted to use this as a placeholder. If neither the prediction or plots namespaces are optional, we can probably unify these utilities into a single has_hdbscan like we have for other libraries (and customize the raised error in hdbscan.pyx).

Alternatively, I can add a has_hdbscan in this PR and use it.

I think has_hdbscan makes more sense. Initially we only cared that the plotting package was available so it was named accordingly but since then we've added the prediction and we only really care that hdbscan itself is available.

Sounds good. ~~Would you prefer I open a separate PR to refactor or throw it into this one?~~ Will add the utility and refactor hdbscan.pyx here.

…scan package

beckernick · 2023-02-24T21:52:35Z

rerun tests

cjnolet · 2023-02-25T00:56:55Z

/merge

beckernick · 2023-02-28T15:19:36Z

rerun tests

add hdbscan, linearsvc, and linearsvr; update RF arguments

ea1452d

beckernick added Cython / Python Cython or Python issue benchmarking non-breaking Non-breaking change labels Feb 17, 2023

beckernick self-assigned this Feb 17, 2023

beckernick commented Feb 17, 2023

View reviewed changes

beckernick added the improvement Improvement / enhancement to an existing function label Feb 17, 2023

beckernick added 2 commits February 17, 2023 13:10

n_jobs typo

abc52f9

dont raise if unavailable

b5c030b

beckernick marked this pull request as ready for review February 17, 2023 18:44

beckernick requested a review from a team as a code owner February 17, 2023 18:44

beckernick added 2 commits February 21, 2023 11:48

Merge branch 'branch-23.04' into benchmark-updates

01342c1

Merge branch 'branch-23.04' into benchmark-updates

7d129a4

beckernick changed the title ~~Update cuML benchmark utility (add Linear{SVC, SVR} and HDBSCAN and improve Random Forest fairness)~~ Enhance cuML benchmark utility and refactor hdbscan import utilities Feb 21, 2023

beckernick added 2 commits February 21, 2023 17:28

hdbscan import utils refactoring. just require the full, standard hdb…

dd4f3b1

…scan package

Merge branch 'branch-23.04' into benchmark-updates

4c659f1

cjnolet approved these changes Feb 24, 2023

View reviewed changes

beckernick mentioned this pull request Feb 24, 2023

Update cuML benchmark utility (add LinearSVC and improve Random Forest fairness) #5165

Closed

exactlyallan mentioned this pull request Feb 25, 2023

Replace CUML benchmark numbers rapidsai/rapids.ai#300

Closed

Merge branch 'branch-23.04' into benchmark-updates

83213ab

beckernick added 3 commits February 28, 2023 16:41

Merge branch 'branch-23.04' into benchmark-updates

6ebeb59

Merge branch 'branch-23.04' into benchmark-updates

67a32d2

Merge branch 'branch-23.04' into benchmark-updates

82647bd

rapids-bot bot merged commit 96d9ffe into rapidsai:branch-23.04 Mar 6, 2023

beckernick deleted the benchmark-updates branch March 16, 2023 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance cuML benchmark utility and refactor hdbscan import utilities #5242

Enhance cuML benchmark utility and refactor hdbscan import utilities #5242

beckernick commented Feb 17, 2023 •

edited

Loading

beckernick Feb 17, 2023 •

edited

Loading

cjnolet Feb 21, 2023

beckernick Feb 21, 2023 •

edited

Loading

beckernick commented Feb 24, 2023

cjnolet commented Feb 25, 2023

beckernick commented Feb 28, 2023

		@@ -69,6 +69,10 @@
		import umap


		if has_hdbscan_prediction(raise_if_unavailable=False):

Enhance cuML benchmark utility and refactor hdbscan import utilities #5242

Enhance cuML benchmark utility and refactor hdbscan import utilities #5242

Conversation

beckernick commented Feb 17, 2023 • edited Loading

beckernick Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

cjnolet Feb 21, 2023

Choose a reason for hiding this comment

beckernick Feb 21, 2023 • edited Loading

Choose a reason for hiding this comment

beckernick commented Feb 24, 2023

cjnolet commented Feb 25, 2023

beckernick commented Feb 28, 2023

beckernick commented Feb 17, 2023 •

edited

Loading

beckernick Feb 17, 2023 •

edited

Loading

beckernick Feb 21, 2023 •

edited

Loading