-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria #4216
[REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria #4216
Conversation
…fea-ext-rf-poisson-split-criterion
…fea-ext-rf-poisson-split-criterion
…fea-ext-rf-poisson-split-criterion
…fea-ext-rf-poisson-split-criterion
…fea-ext-rf-poisson-split-criterion
…fea-ext-rf-poisson-split-criterion
Removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love to see the use of c++17 features and good use of STL for clean code.
Is this PR significantly impacting compile times due to the new template instantiations?
Could you produce a couple of python plots in the comments of this PR, of the same type that you did for Poisson, showing that these objectives reduce their accompanying loss function better than an MSE objective. i.e. train two models one with MSE objective, one with gamma, evaluate gamma training loss for both, the model trained with gamma objective should have a better loss.
Edit: I guess your tests are doing this already, but it would be useful to verify it visually.
def test_poisson_convergence(lam, max_depth): | ||
@pytest.mark.parametrize("split_criterion", | ||
["poisson", "gamma", "inverse_gaussian"]) | ||
def test_tweedie_convergence(max_depth, split_criterion): | ||
np.random.seed(33) | ||
bootstrap = None | ||
max_features = 1.0 | ||
n_estimators = 1 | ||
min_impurity_decrease = 1e-5 | ||
n_datapoints = 100000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This amount of data is a bit excessive, you could dial this back to 1000.
it is impacting compile-time due to more template instantiations of the builder class... when I used ccache with |
convergence plots of tweedies w.r.t mse: script usedimport numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from cuml.ensemble import RandomForestRegressor as curfr
from sklearn.metrics import mean_tweedie_deviance
split_criterion_list = ["poisson", "gamma", "inverse_gaussian"]
max_depth_list = [2, 4, 6, 8, 12]
df = pd.DataFrame(columns=["loss", "mse_tweedie_deviance", "tweedie_tweedie_deviance", "depth"])
for split_criterion, max_depth in itertools.product(split_criterion_list, max_depth_list):
np.random.seed(33)
bootstrap = None
max_features = 1.0
n_estimators = 1
min_impurity_decrease = 1e-5
n_datapoints = 100000
tweedie = {
"poisson":
{"power": 1,
"gen": np.random.poisson, "args": [0.01]},
"gamma":
{"power": 2,
"gen": np.random.gamma, "args": [2.0]},
"inverse_gaussian":
{"power": 3,
"gen": np.random.wald, "args": [0.1, 2.0]}
}
# generating random dataset with tweedie distribution
X = np.random.random((n_datapoints, 4)).astype(np.float32)
y = tweedie[split_criterion]["gen"](*tweedie[split_criterion]["args"],
size=n_datapoints).astype(np.float32)
tweedie_preds = curfr(
split_criterion=split_criterion,
max_depth=max_depth,
n_estimators=n_estimators,
bootstrap=bootstrap,
max_features=max_features,
min_impurity_decrease=min_impurity_decrease).fit(X, y).predict(X)
mse_preds = curfr(
split_criterion=2,
max_depth=max_depth,
n_estimators=n_estimators,
bootstrap=bootstrap,
max_features=max_features,
min_impurity_decrease=min_impurity_decrease).fit(X, y).predict(X)
# y should be positive and non-zero
mask = mse_preds > 0
mse_tweedie_deviance = mean_tweedie_deviance(y[mask],
mse_preds[mask],
power=tweedie
[split_criterion]["power"])
tweedie_tweedie_deviance = mean_tweedie_deviance(y[mask],
tweedie_preds[mask],
power=tweedie
[split_criterion]["power"]
)
df = df.append({
"loss": split_criterion,
"mse_tweedie_deviance": mse_tweedie_deviance,
"tweedie_tweedie_deviance": tweedie_tweedie_deviance,
"depth": max_depth,
}, ignore_index=True)
# model trained on tweedie data with
# tweedie criterion must perform better on tweedie loss
assert mse_tweedie_deviance >= tweedie_tweedie_deviance
matplotlib.use("Agg")
sns.set()
tweedies = ["poisson", "gamma", "inverse_gaussian"]
figs, axes = plt.subplots(nrows=len(tweedies), ncols=1, squeeze=True, figsize=(7, 12))
for loss, ax in zip(tweedies, axes):
plot = sns.lineplot(data=df[df["loss"] == loss], x="depth", y="tweedie_tweedie_deviance", ax=ax, label=f"trained on {loss}")
baseline = sns.lineplot(data=df[df["loss"] == loss], x="depth", y="mse_tweedie_deviance", ax=ax, label=f"trained on mse")
ax.set_ylabel(f"{loss} deviance")
figs.tight_layout()
plt.savefig('convergence.png') |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just one missing doc block, but I didn't notice any other issues.
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #4216 +/- ##
===============================================
Coverage ? 86.07%
===============================================
Files ? 231
Lines ? 18694
Branches ? 0
===============================================
Hits ? 16090
Misses ? 2604
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@gpucibot merge |
* Some updates to RF documentation * to be merged after #4216 Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4138
This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests. --- checklist: - [x] Add Gamma and Inverse Gaussian Objective classes - [x] Add C++ tests for above - [x] Add remaining C++ tests for other objective functions: entropy and mean squared error - [x] Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in rapidsai#4156 ) - [x] Check for regressions by benchmarking on gbm-bench - [x] Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (`mse`) Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - William Hicks (https://github.com/wphicks) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4216
* Some updates to RF documentation * to be merged after rapidsai#4216 Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4138
This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests.
checklist:
mse
)