[REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria #4216

venkywonka · 2021-09-21T16:04:16Z

This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests.

checklist:

Add Gamma and Inverse Gaussian Objective classes
Add C++ tests for above
Add remaining C++ tests for other objective functions: entropy and mean squared error
Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in [REVIEW] RF: Add Poisson deviance impurity criterion #4156 )
Check for regressions by benchmarking on gbm-bench
Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (mse)

…fea-ext-rf-poisson-split-criterion

venkywonka · 2021-10-05T09:51:58Z

No regressions on gbm-bench for existing classification and regression tasks due to this PR.
Due to a reduction in division operations inside gain calculation of mse objective, there is a slight increase in performance for mse regression without any loss in accuracy

mse

poisson

entropy

gini

ajschmidt8 · 2021-10-05T13:15:07Z

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

…fea-ext-tweedie-loss

RAMitchell

Love to see the use of c++17 features and good use of STL for clean code.

Is this PR significantly impacting compile times due to the new template instantiations?

Could you produce a couple of python plots in the comments of this PR, of the same type that you did for Poisson, showing that these objectives reduce their accompanying loss function better than an MSE objective. i.e. train two models one with MSE objective, one with gamma, evaluate gamma training loss for both, the model trained with gamma objective should have a better loss.

Edit: I guess your tests are doing this already, but it would be useful to verify it visually.

RAMitchell · 2021-10-06T23:47:45Z

python/cuml/test/test_random_forest.py

-def test_poisson_convergence(lam, max_depth):
+@pytest.mark.parametrize("split_criterion",
+                         ["poisson", "gamma", "inverse_gaussian"])
+def test_tweedie_convergence(max_depth, split_criterion):
    np.random.seed(33)
    bootstrap = None
    max_features = 1.0
    n_estimators = 1
    min_impurity_decrease = 1e-5
    n_datapoints = 100000


This amount of data is a bit excessive, you could dial this back to 1000.

venkywonka · 2021-10-07T05:38:44Z

Love to see the use of c++17 features and good use of STL for clean code.

Is this PR significantly impacting compile times due to the new template instantiations?

Could you produce a couple of python plots in the comments of this PR, of the same type that you did for Poisson, showing that these objectives reduce their accompanying loss function better than an MSE objective. i.e. train two models one with MSE objective, one with gamma, evaluate gamma training loss for both, the model trained with gamma objective should have a better loss.

Edit: I guess your tests are doing this already, but it would be useful to verify it visually.

it is impacting compile-time due to more template instantiations of the builder class... when I used ccache with ./build.sh though, didn't observe the slowdown

venkywonka · 2021-10-07T11:22:16Z

convergence plots of tweedies w.r.t mse:

plots

script used

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import itertools

from cuml.ensemble import RandomForestRegressor as curfr
from sklearn.metrics import mean_tweedie_deviance

split_criterion_list = ["poisson", "gamma", "inverse_gaussian"]
max_depth_list = [2, 4, 6, 8, 12]
df = pd.DataFrame(columns=["loss", "mse_tweedie_deviance", "tweedie_tweedie_deviance", "depth"])

for split_criterion, max_depth in itertools.product(split_criterion_list, max_depth_list):
    np.random.seed(33)
    bootstrap = None
    max_features = 1.0
    n_estimators = 1
    min_impurity_decrease = 1e-5
    n_datapoints = 100000
    tweedie = {
        "poisson":
            {"power": 1,
                "gen": np.random.poisson, "args": [0.01]},
        "gamma":
            {"power": 2,
                "gen": np.random.gamma, "args": [2.0]},
        "inverse_gaussian":
            {"power": 3,
                "gen": np.random.wald, "args": [0.1, 2.0]}
    }
    # generating random dataset with tweedie distribution
    X = np.random.random((n_datapoints, 4)).astype(np.float32)
    y = tweedie[split_criterion]["gen"](*tweedie[split_criterion]["args"],
                                        size=n_datapoints).astype(np.float32)

    tweedie_preds = curfr(
        split_criterion=split_criterion,
        max_depth=max_depth,
        n_estimators=n_estimators,
        bootstrap=bootstrap,
        max_features=max_features,
        min_impurity_decrease=min_impurity_decrease).fit(X, y).predict(X)
    mse_preds = curfr(
        split_criterion=2,
        max_depth=max_depth,
        n_estimators=n_estimators,
        bootstrap=bootstrap,
        max_features=max_features,
        min_impurity_decrease=min_impurity_decrease).fit(X, y).predict(X)
    # y should be positive and non-zero
    mask = mse_preds > 0
    mse_tweedie_deviance = mean_tweedie_deviance(y[mask],
                                                    mse_preds[mask],
                                                    power=tweedie
                                                    [split_criterion]["power"])
    tweedie_tweedie_deviance = mean_tweedie_deviance(y[mask],
                                                        tweedie_preds[mask],
                                                        power=tweedie
                                                        [split_criterion]["power"]
                                                        )

    df = df.append({
        "loss": split_criterion,
        "mse_tweedie_deviance": mse_tweedie_deviance,
        "tweedie_tweedie_deviance": tweedie_tweedie_deviance,
        "depth": max_depth,
    }, ignore_index=True)

    # model trained on tweedie data with
    # tweedie criterion must perform better on tweedie loss
    assert mse_tweedie_deviance >= tweedie_tweedie_deviance

matplotlib.use("Agg")
sns.set()
tweedies = ["poisson", "gamma", "inverse_gaussian"]
figs, axes = plt.subplots(nrows=len(tweedies), ncols=1, squeeze=True, figsize=(7, 12))
for loss, ax in zip(tweedies, axes):
    plot = sns.lineplot(data=df[df["loss"] == loss], x="depth", y="tweedie_tweedie_deviance", ax=ax, label=f"trained on {loss}")
    baseline = sns.lineplot(data=df[df["loss"] == loss], x="depth", y="mse_tweedie_deviance", ax=ax, label=f"trained on mse")
    ax.set_ylabel(f"{loss} deviance")
figs.tight_layout()
plt.savefig('convergence.png')

venkywonka · 2021-10-08T08:00:09Z

rerun tests

wphicks

Looks great! Just one missing doc block, but I didn't notice any other issues.

cpp/src/decisiontree/batched-levelalgo/metrics.cuh

codecov-commenter · 2021-10-11T16:14:23Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.12@c3b5aec). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.12    #4216   +/-   ##
===============================================
  Coverage                ?   86.07%           
===============================================
  Files                   ?      231           
  Lines                   ?    18694           
  Branches                ?        0           
===============================================
  Hits                    ?    16090           
  Misses                  ?     2604           
  Partials                ?        0

Flag	Coverage Δ
dask	`47.01% <0.00%> (?)`
non-dask	`78.75% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c3b5aec...651d827. Read the comment docs.

dantegd · 2021-10-12T13:16:26Z

@gpucibot merge

* Some updates to RF documentation * to be merged after #4216 Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4138

This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests. --- checklist: - [x] Add Gamma and Inverse Gaussian Objective classes - [x] Add C++ tests for above - [x] Add remaining C++ tests for other objective functions: entropy and mean squared error - [x] Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in rapidsai#4156 ) - [x] Check for regressions by benchmarking on gbm-bench - [x] Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (`mse`) Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - William Hicks (https://github.com/wphicks) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4216

* Some updates to RF documentation * to be merged after rapidsai#4216 Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4138

venkywonka added 26 commits August 11, 2021 21:31

add poisson deviance loss

ceee023

sign bug fix

a40c323

modify proxy impurity, refactor tests, clang fix

8cd1ce1

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

c185c80

…fea-ext-rf-poisson-split-criterion

add tests for poisson & gini objectives, bug fixes and other refactors

dca32f9

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

6039045

…fea-ext-rf-poisson-split-criterion

FIX clang format

925116d

FIX clang format

3142caf

remove debug code

9676818

address review comments

c52c29f

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

36615c3

…fea-ext-rf-poisson-split-criterion

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

c0c5948

…fea-ext-rf-poisson-split-criterion

add python level test

79f00b8

FIX clang format

13c3386

flake fix, reduce test load

0332cc6

fix tests, remove artifacts

0a5d52a

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

3255323

…fea-ext-rf-poisson-split-criterion

purge artifacts

959ee2c

decrease tolerance

5a5410e

remove min_impurity_decrease member

59caf11

fix accuracy bug and dask docstring duplication

fd42fb7

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

9247988

…fea-ext-rf-poisson-split-criterion

fix doctring slip

a31512d

merge resolution

493f847

merge with poisson branch

aec9d26

add tweedie losses

db09e0f

github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Sep 21, 2021

refactor unit tests

e63754a

flake fix and change criterion_dict

8464628

venkywonka added the improvement Improvement / enhancement to an existing function label Oct 5, 2021

make objective data members private

d764562

venkywonka changed the title ~~[WIP] RF: Add Gamma and Inverse Gaussian loss criteria~~ [REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria Oct 5, 2021

ajschmidt8 removed the request for review from a team October 5, 2021 13:15

venkywonka added 3 commits October 6, 2021 19:40

refactor declaration

68ecabb

Merge branch 'branch-21.12' of https://github.com/rapidsai/cuml into …

6eeeac0

…fea-ext-tweedie-loss

fix improper merge

b1be698

RAMitchell reviewed Oct 6, 2021

View reviewed changes

update datapoints and args in pytest

57dddaf

RAMitchell approved these changes Oct 7, 2021

View reviewed changes

wphicks requested changes Oct 11, 2021

View reviewed changes

cpp/src/decisiontree/batched-levelalgo/metrics.cuh Show resolved Hide resolved

add documentation for other GainPerSplits

651d827

wphicks approved these changes Oct 11, 2021

View reviewed changes

venkywonka mentioned this pull request Oct 11, 2021

[REVIEW] update RF docs #4138

Merged

dantegd approved these changes Oct 12, 2021

View reviewed changes

rapids-bot bot merged commit ead8ef2 into rapidsai:branch-21.12 Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria #4216

[REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria #4216

venkywonka commented Sep 21, 2021 •

edited

Loading

venkywonka commented Oct 5, 2021 •

edited

Loading

ajschmidt8 commented Oct 5, 2021

RAMitchell left a comment •

edited

Loading

RAMitchell Oct 6, 2021

venkywonka commented Oct 7, 2021

venkywonka commented Oct 7, 2021

venkywonka commented Oct 8, 2021

wphicks left a comment •

edited

Loading

codecov-commenter commented Oct 11, 2021

dantegd commented Oct 12, 2021

[REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria #4216

[REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria #4216

Conversation

venkywonka commented Sep 21, 2021 • edited Loading

venkywonka commented Oct 5, 2021 • edited Loading

ajschmidt8 commented Oct 5, 2021

RAMitchell left a comment • edited Loading

Choose a reason for hiding this comment

RAMitchell Oct 6, 2021

Choose a reason for hiding this comment

venkywonka commented Oct 7, 2021

venkywonka commented Oct 7, 2021

venkywonka commented Oct 8, 2021

wphicks left a comment • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Oct 11, 2021

Codecov Report

dantegd commented Oct 12, 2021

venkywonka commented Sep 21, 2021 •

edited

Loading

venkywonka commented Oct 5, 2021 •

edited

Loading

RAMitchell left a comment •

edited

Loading

wphicks left a comment •

edited

Loading