-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] RF: Add Poisson deviance impurity criterion #4156
[REVIEW] RF: Add Poisson deviance impurity criterion #4156
Conversation
Perf and accuracy checks for randomly generated poisson dataset. The expectation is that tree models trained with criterion Script usedfrom cuml import RandomForestRegressor as cuRF
from sklearn.tree import DecisionTreeRegressor as sklDT
from sklearn.metrics import mean_poisson_deviance
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import time
matplotlib.use("Agg")
sns.set()
def poisson_random_dataset(lam=0.1, n_datapoints=100000):
np.random.seed(33)
X = np.random.random((n_datapoints, 4)).astype(np.float32)
y = np.random.poisson(lam=lam, size=n_datapoints).astype(np.float32)
return X, y
rs = np.random.RandomState(92)
depths = range(1, 8)
bootstrap = None
max_features = 1.0
n_estimators = 1
min_impurity_decrease = 1e-5
n_datapoints = 100000
algo = {
"skl_dt_poisson": sklDT(
random_state=rs,
min_impurity_decrease=min_impurity_decrease,
criterion="poisson",
),
"cuml_dt_poisson": cuRF(
n_estimators=n_estimators,
random_state=rs.randint(0, 1 << 32),
bootstrap=bootstrap,
min_impurity_decrease=min_impurity_decrease,
split_criterion=4, # poisson
),
"skl_dt_mse": sklDT(
random_state=rs,
min_impurity_decrease=min_impurity_decrease,
criterion="mse",
),
"cuml_dt_mse": cuRF(
n_estimators=n_estimators,
random_state=rs.randint(0, 1 << 32),
bootstrap=bootstrap,
min_impurity_decrease=min_impurity_decrease,
split_criterion=2, # mse
),
}
datasets = {
"poisson-0.1": poisson_random_dataset(0.1, n_datapoints),
}
figs, axes = plt.subplots(nrows=len(datasets.items()), ncols=2, squeeze=False, figsize=(12, 7))
for score_ax, time_ax, (data_name, (X, y)) in zip(axes[:,0], axes[:, 1], datasets.items()):
X=X.astype(np.float32)
y=y.astype(np.float32)
df = pd.DataFrame(columns=["algorithm", "accuracy", "depth", "time"])
df_cuml = pd.DataFrame(columns=["algorithm", "accuracy", "depth", "time"])
for d in depths:
for name, alg in algo.items():
alg.set_params(max_depth=d)
start = time.time()
alg.fit(X, y)
end = time.time()
pred = alg.predict(X)
# we only want the positive predictions for mean_poisson_deviance
mask = pred > 0
accuracy = 0.0
if (~mask).any():
n_masked, n_samples = (~mask).sum(), mask.shape[0]
ic(n_masked, n_samples)
accuracy = mean_poisson_deviance(y[mask], pred[mask])
else:
accuracy = mean_poisson_deviance(y, pred)
df = df.append(
{"algorithm": name, "accuracy": accuracy, "depth": d, "time": end - start},
ignore_index=True,
)
print(df)
### score
sns.lineplot(data=df, x="depth", y="accuracy", hue="algorithm", ax=score_ax)
score_ax.set_title(f'{data_name} poisson loss on {n_datapoints} data points')
score_ax.set_ylabel("train poisson")
score_ax.set_xlabel("tree depth")
### timing
sns.barplot(data=df[df["depth"]>1], x="depth", y="time", hue="algorithm", ax=time_ax) # the first run is warmup
time_ax.set_title(f'{data_name} times (s) on {n_datapoints} data points')
time_ax.set_ylabel("timing poisson")
time_ax.set_xlabel("tree depth")
plt.tight_layout()
plt.savefig("poisson-0.1-skl-vs-cuml.png")
plt.clf() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poisson implementation looks good. I'd like to see the comment updated as best you can explaining how we got our formula. Also some c++ unit tests just for the objective class and at least one python level test (probably you can just add poisson to the parameters of some existing test).
…fea-ext-rf-poisson-split-criterion
…fea-ext-rf-poisson-split-criterion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes all look good. Can we get at least one python level test? It can be very simple or an extension of existing tests. We can be confident that the C++ code is working correctly, but we want to check that the interface is correctly plumbed into this code.
sure Rory, I added it yesterday night, but building and testing took time so called it a day, would push them in a bit 👍 |
…fea-ext-rf-poisson-split-criterion
…fea-ext-rf-poisson-split-criterion
…fea-ext-rf-poisson-split-criterion
rerun tests |
Just had another look at the failing tests - as they are only regression, I would take a careful look at your changes to the MSE loss function for any subtle changes in behaviour, e.g. handling of edge cases like having only one or two data points. |
rerun tests |
@@ -74,16 +74,12 @@ class RandomForestClassifier(BaseRandomForestModel, DelayedPredictionMixin, | |||
run different models concurrently in different streams by creating | |||
handles in several streams. | |||
If it is None, a new one is created. | |||
split_criterion : The criterion used to split nodes. | |||
0 for GINI, 1 for ENTROPY, 4 for CRITERION_END. | |||
split_criterion : int (default = 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have fixed it in this commit rory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving pending conflict resolution
…fea-ext-rf-poisson-split-criterion
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #4156 +/- ##
===============================================
Coverage ? 86.07%
===============================================
Files ? 231
Lines ? 18634
Branches ? 0
===============================================
Hits ? 16040
Misses ? 2594
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@gpucibot merge |
This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests. --- checklist: - [x] Add Gamma and Inverse Gaussian Objective classes - [x] Add C++ tests for above - [x] Add remaining C++ tests for other objective functions: entropy and mean squared error - [x] Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in #4156 ) - [x] Check for regressions by benchmarking on gbm-bench - [x] Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (`mse`) Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - William Hicks (https://github.com/wphicks) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4216
* Adds the poisson impurity criterion to RF, in parity with scikit learn's RF regressor [[here](https://scikit-learn.org/stable/modules/tree.html#regression-criteria)] EDIT: * Also adds C++ level testing for RF Objective function gains of Poisson and Gini. Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4156
This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests. --- checklist: - [x] Add Gamma and Inverse Gaussian Objective classes - [x] Add C++ tests for above - [x] Add remaining C++ tests for other objective functions: entropy and mean squared error - [x] Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in rapidsai#4156 ) - [x] Check for regressions by benchmarking on gbm-bench - [x] Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (`mse`) Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - William Hicks (https://github.com/wphicks) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4216
EDIT: