[REVIEW] RF: Add Poisson deviance impurity criterion #4156

venkywonka · 2021-08-11T16:04:45Z

Adds the poisson impurity criterion to RF, in parity with scikit learn's RF regressor [here]
EDIT:
Also adds C++ level testing for RF Objective function gains of Poisson and Gini.

venkywonka · 2021-08-19T16:38:58Z

Perf and accuracy checks for randomly generated poisson dataset. The expectation is that tree models trained with criterion "poisson" converge better with lower loss on mean_poisson_deviance(y, y_pred).

Script used

from cuml import RandomForestRegressor as cuRF
from sklearn.tree import DecisionTreeRegressor as sklDT
from sklearn.metrics import mean_poisson_deviance
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import time

matplotlib.use("Agg")
sns.set()

def poisson_random_dataset(lam=0.1, n_datapoints=100000):
    np.random.seed(33)
    X = np.random.random((n_datapoints, 4)).astype(np.float32)
    y = np.random.poisson(lam=lam, size=n_datapoints).astype(np.float32)
    return X, y

rs = np.random.RandomState(92)
depths = range(1, 8)
bootstrap = None
max_features = 1.0
n_estimators = 1
min_impurity_decrease = 1e-5
n_datapoints = 100000

algo = {
    "skl_dt_poisson": sklDT(
        random_state=rs,
        min_impurity_decrease=min_impurity_decrease,
        criterion="poisson",
    ),
    "cuml_dt_poisson": cuRF(
        n_estimators=n_estimators,
        random_state=rs.randint(0, 1 << 32),
        bootstrap=bootstrap,
        min_impurity_decrease=min_impurity_decrease,
        split_criterion=4, # poisson
    ),
    "skl_dt_mse": sklDT(
        random_state=rs,
        min_impurity_decrease=min_impurity_decrease,
        criterion="mse",
    ),
    "cuml_dt_mse": cuRF(
        n_estimators=n_estimators,
        random_state=rs.randint(0, 1 << 32),
        bootstrap=bootstrap,
        min_impurity_decrease=min_impurity_decrease,
        split_criterion=2, # mse
    ),
}

datasets = {
    "poisson-0.1": poisson_random_dataset(0.1, n_datapoints),
}

figs, axes = plt.subplots(nrows=len(datasets.items()), ncols=2, squeeze=False, figsize=(12, 7))

for score_ax, time_ax, (data_name, (X, y)) in zip(axes[:,0], axes[:, 1], datasets.items()):
    X=X.astype(np.float32)
    y=y.astype(np.float32)
    df = pd.DataFrame(columns=["algorithm", "accuracy", "depth", "time"])
    df_cuml = pd.DataFrame(columns=["algorithm", "accuracy", "depth", "time"])
    for d in depths:
        for name, alg in algo.items():
            alg.set_params(max_depth=d)
            start = time.time()
            alg.fit(X, y)
            end = time.time()
            pred = alg.predict(X)
            # we only want the positive predictions for mean_poisson_deviance
            mask = pred > 0
            accuracy = 0.0
            if (~mask).any():
                n_masked, n_samples = (~mask).sum(), mask.shape[0]
                ic(n_masked, n_samples)
                accuracy = mean_poisson_deviance(y[mask], pred[mask])
            else:
                accuracy = mean_poisson_deviance(y, pred)
            df = df.append(
                {"algorithm": name, "accuracy": accuracy, "depth": d, "time": end - start},
                ignore_index=True,
            )
    print(df)
    ### score
    sns.lineplot(data=df, x="depth", y="accuracy", hue="algorithm", ax=score_ax)
    score_ax.set_title(f'{data_name} poisson loss on {n_datapoints} data points')
    score_ax.set_ylabel("train poisson")
    score_ax.set_xlabel("tree depth")
    ### timing
    sns.barplot(data=df[df["depth"]>1], x="depth", y="time", hue="algorithm", ax=time_ax) # the first run is warmup
    time_ax.set_title(f'{data_name} times (s) on {n_datapoints} data points')
    time_ax.set_ylabel("timing poisson")
    time_ax.set_xlabel("tree depth")

plt.tight_layout()
plt.savefig("poisson-0.1-skl-vs-cuml.png")
plt.clf()

RAMitchell

Poisson implementation looks good. I'd like to see the comment updated as best you can explaining how we got our formula. Also some c++ unit tests just for the objective class and at least one python level test (probably you can just add poisson to the parameters of some existing test).

cpp/src/decisiontree/batched-levelalgo/metrics.cuh

python/cuml/dask/ensemble/randomforestregressor.py

…fea-ext-rf-poisson-split-criterion

cpp/src/decisiontree/batched-levelalgo/metrics.cuh

cpp/test/sg/rf_test.cu

RAMitchell

Changes all look good. Can we get at least one python level test? It can be very simple or an extension of existing tests. We can be confident that the C++ code is working correctly, but we want to check that the interface is correctly plumbed into this code.

venkywonka · 2021-09-03T01:45:30Z

sure Rory, I added it yesterday night, but building and testing took time so called it a day, would push them in a bit 👍

…fea-ext-rf-poisson-split-criterion

venkywonka · 2021-09-14T16:11:59Z

rerun tests

RAMitchell · 2021-09-14T23:29:29Z

Just had another look at the failing tests - as they are only regression, I would take a careful look at your changes to the MSE loss function for any subtle changes in behaviour, e.g. handling of edge cases like having only one or two data points.

…fea-ext-rf-poisson-split-criterion

venkywonka · 2021-09-20T08:40:40Z

rerun tests

RAMitchell · 2021-09-20T23:08:56Z

python/cuml/dask/ensemble/randomforestclassifier.py

@@ -74,16 +74,12 @@ class RandomForestClassifier(BaseRandomForestModel, DelayedPredictionMixin,
        run different models concurrently in different streams by creating
        handles in several streams.
        If it is None, a new one is created.
-    split_criterion : The criterion used to split nodes.
-        0 for GINI, 1 for ENTROPY, 4 for CRITERION_END.
+    split_criterion : int (default = 2)


have fixed it in this commit rory.

dantegd

approving pending conflict resolution

…fea-ext-rf-poisson-split-criterion

codecov-commenter · 2021-09-22T17:45:34Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@9f9e16f). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.10    #4156   +/-   ##
===============================================
  Coverage                ?   86.07%           
===============================================
  Files                   ?      231           
  Lines                   ?    18634           
  Branches                ?        0           
===============================================
  Hits                    ?    16040           
  Misses                  ?     2594           
  Partials                ?        0

Flag	Coverage Δ
dask	`47.05% <0.00%> (?)`
non-dask	`78.74% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f9e16f...a3fa800. Read the comment docs.

dantegd · 2021-09-22T21:09:47Z

@gpucibot merge

This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests. --- checklist: - [x] Add Gamma and Inverse Gaussian Objective classes - [x] Add C++ tests for above - [x] Add remaining C++ tests for other objective functions: entropy and mean squared error - [x] Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in #4156 ) - [x] Check for regressions by benchmarking on gbm-bench - [x] Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (`mse`) Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - William Hicks (https://github.com/wphicks) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4216

* Adds the poisson impurity criterion to RF, in parity with scikit learn's RF regressor [[here](https://scikit-learn.org/stable/modules/tree.html#regression-criteria)] EDIT: * Also adds C++ level testing for RF Objective function gains of Poisson and Gini. Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4156

This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests. --- checklist: - [x] Add Gamma and Inverse Gaussian Objective classes - [x] Add C++ tests for above - [x] Add remaining C++ tests for other objective functions: entropy and mean squared error - [x] Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in rapidsai#4156 ) - [x] Check for regressions by benchmarking on gbm-bench - [x] Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (`mse`) Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - William Hicks (https://github.com/wphicks) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4216

add poisson deviance loss

ceee023

github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Aug 11, 2021

sign bug fix

a40c323

modify proxy impurity, refactor tests, clang fix

8cd1ce1

venkywonka marked this pull request as ready for review August 19, 2021 16:40

venkywonka requested review from a team as code owners August 19, 2021 16:40

venkywonka changed the title ~~[WIP] RF: Add Poisson deviance impurity criterion~~ [REVIEW] RF: Add Poisson deviance impurity criterion Aug 19, 2021

RAMitchell reviewed Aug 23, 2021

View reviewed changes

cpp/src/decisiontree/batched-levelalgo/metrics.cuh Outdated Show resolved Hide resolved

python/cuml/dask/ensemble/randomforestregressor.py Outdated Show resolved Hide resolved

venkywonka added 6 commits August 24, 2021 13:24

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

c185c80

…fea-ext-rf-poisson-split-criterion

add tests for poisson & gini objectives, bug fixes and other refactors

dca32f9

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

6039045

…fea-ext-rf-poisson-split-criterion

FIX clang format

925116d

FIX clang format

3142caf

remove debug code

9676818

RAMitchell reviewed Sep 1, 2021

View reviewed changes

address review comments

c52c29f

RAMitchell reviewed Sep 2, 2021

View reviewed changes

venkywonka added 3 commits September 3, 2021 20:19

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

36615c3

…fea-ext-rf-poisson-split-criterion

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

c0c5948

…fea-ext-rf-poisson-split-criterion

add python level test

79f00b8

venkywonka added breaking Breaking change improvement Improvement / enhancement to an existing function labels Sep 11, 2021

FIX clang format

13c3386

RAMitchell approved these changes Sep 13, 2021

View reviewed changes

flake fix, reduce test load

0332cc6

venkywonka added 4 commits September 13, 2021 16:44

fix tests, remove artifacts

0a5d52a

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

3255323

…fea-ext-rf-poisson-split-criterion

purge artifacts

959ee2c

decrease tolerance

5a5410e

venkywonka added 5 - Ready to Merge Testing and reviews complete, ready to merge 3 - Ready for Review Ready for review by team and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Sep 14, 2021

venkywonka added 5 commits September 16, 2021 18:00

remove min_impurity_decrease member

59caf11

fix accuracy bug and dask docstring duplication

fd42fb7

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

9247988

…fea-ext-rf-poisson-split-criterion

fix doctring slip

a31512d

merge resolution

493f847

RAMitchell reviewed Sep 20, 2021

View reviewed changes

RAMitchell approved these changes Sep 21, 2021

View reviewed changes

dantegd approved these changes Sep 21, 2021

View reviewed changes

dantegd added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 21, 2021

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

a3fa800

…fea-ext-rf-poisson-split-criterion

rapids-bot bot merged commit 45a5a54 into rapidsai:branch-21.10 Sep 22, 2021

venkywonka mentioned this pull request Sep 23, 2021

[REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria #4216

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] RF: Add Poisson deviance impurity criterion #4156

[REVIEW] RF: Add Poisson deviance impurity criterion #4156

venkywonka commented Aug 11, 2021 •

edited

Loading

venkywonka commented Aug 19, 2021 •

edited

Loading

RAMitchell left a comment

RAMitchell left a comment

venkywonka commented Sep 3, 2021

venkywonka commented Sep 14, 2021

RAMitchell commented Sep 14, 2021

venkywonka commented Sep 20, 2021

RAMitchell Sep 20, 2021

venkywonka Sep 21, 2021

dantegd left a comment

codecov-commenter commented Sep 22, 2021

dantegd commented Sep 22, 2021

[REVIEW] RF: Add Poisson deviance impurity criterion #4156

[REVIEW] RF: Add Poisson deviance impurity criterion #4156

Conversation

venkywonka commented Aug 11, 2021 • edited Loading

venkywonka commented Aug 19, 2021 • edited Loading

Script used

RAMitchell left a comment

Choose a reason for hiding this comment

RAMitchell left a comment

Choose a reason for hiding this comment

venkywonka commented Sep 3, 2021

venkywonka commented Sep 14, 2021

RAMitchell commented Sep 14, 2021

venkywonka commented Sep 20, 2021

RAMitchell Sep 20, 2021

Choose a reason for hiding this comment

venkywonka Sep 21, 2021

Choose a reason for hiding this comment

dantegd left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 22, 2021

Codecov Report

dantegd commented Sep 22, 2021

venkywonka commented Aug 11, 2021 •

edited

Loading

venkywonka commented Aug 19, 2021 •

edited

Loading