Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] RF: Add Poisson deviance impurity criterion #4156

Merged

Conversation

venkywonka
Copy link
Contributor

@venkywonka venkywonka commented Aug 11, 2021

  • Adds the poisson impurity criterion to RF, in parity with scikit learn's RF regressor [here]
    EDIT:
  • Also adds C++ level testing for RF Objective function gains of Poisson and Gini.

@github-actions github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Aug 11, 2021
@venkywonka
Copy link
Contributor Author

venkywonka commented Aug 19, 2021

Perf and accuracy checks for randomly generated poisson dataset. The expectation is that tree models trained with criterion "poisson" converge better with lower loss on mean_poisson_deviance(y, y_pred).


Script used

from cuml import RandomForestRegressor as cuRF
from sklearn.tree import DecisionTreeRegressor as sklDT
from sklearn.metrics import mean_poisson_deviance
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import time

matplotlib.use("Agg")
sns.set()

def poisson_random_dataset(lam=0.1, n_datapoints=100000):
    np.random.seed(33)
    X = np.random.random((n_datapoints, 4)).astype(np.float32)
    y = np.random.poisson(lam=lam, size=n_datapoints).astype(np.float32)
    return X, y

rs = np.random.RandomState(92)
depths = range(1, 8)
bootstrap = None
max_features = 1.0
n_estimators = 1
min_impurity_decrease = 1e-5
n_datapoints = 100000

algo = {
    "skl_dt_poisson": sklDT(
        random_state=rs,
        min_impurity_decrease=min_impurity_decrease,
        criterion="poisson",
    ),
    "cuml_dt_poisson": cuRF(
        n_estimators=n_estimators,
        random_state=rs.randint(0, 1 << 32),
        bootstrap=bootstrap,
        min_impurity_decrease=min_impurity_decrease,
        split_criterion=4, # poisson
    ),
    "skl_dt_mse": sklDT(
        random_state=rs,
        min_impurity_decrease=min_impurity_decrease,
        criterion="mse",
    ),
    "cuml_dt_mse": cuRF(
        n_estimators=n_estimators,
        random_state=rs.randint(0, 1 << 32),
        bootstrap=bootstrap,
        min_impurity_decrease=min_impurity_decrease,
        split_criterion=2, # mse
    ),
}

datasets = {
    "poisson-0.1": poisson_random_dataset(0.1, n_datapoints),
}

figs, axes = plt.subplots(nrows=len(datasets.items()), ncols=2, squeeze=False, figsize=(12, 7))

for score_ax, time_ax, (data_name, (X, y)) in zip(axes[:,0], axes[:, 1], datasets.items()):
    X=X.astype(np.float32)
    y=y.astype(np.float32)
    df = pd.DataFrame(columns=["algorithm", "accuracy", "depth", "time"])
    df_cuml = pd.DataFrame(columns=["algorithm", "accuracy", "depth", "time"])
    for d in depths:
        for name, alg in algo.items():
            alg.set_params(max_depth=d)
            start = time.time()
            alg.fit(X, y)
            end = time.time()
            pred = alg.predict(X)
            # we only want the positive predictions for mean_poisson_deviance
            mask = pred > 0
            accuracy = 0.0
            if (~mask).any():
                n_masked, n_samples = (~mask).sum(), mask.shape[0]
                ic(n_masked, n_samples)
                accuracy = mean_poisson_deviance(y[mask], pred[mask])
            else:
                accuracy = mean_poisson_deviance(y, pred)
            df = df.append(
                {"algorithm": name, "accuracy": accuracy, "depth": d, "time": end - start},
                ignore_index=True,
            )
    print(df)
    ### score
    sns.lineplot(data=df, x="depth", y="accuracy", hue="algorithm", ax=score_ax)
    score_ax.set_title(f'{data_name} poisson loss on {n_datapoints} data points')
    score_ax.set_ylabel("train poisson")
    score_ax.set_xlabel("tree depth")
    ### timing
    sns.barplot(data=df[df["depth"]>1], x="depth", y="time", hue="algorithm", ax=time_ax) # the first run is warmup
    time_ax.set_title(f'{data_name} times (s) on {n_datapoints} data points')
    time_ax.set_ylabel("timing poisson")
    time_ax.set_xlabel("tree depth")

plt.tight_layout()
plt.savefig("poisson-0.1-skl-vs-cuml.png")
plt.clf()

poisson-0 1-skl-vs-cuml

@venkywonka venkywonka marked this pull request as ready for review August 19, 2021 16:40
@venkywonka venkywonka requested review from a team as code owners August 19, 2021 16:40
@venkywonka venkywonka changed the title [WIP] RF: Add Poisson deviance impurity criterion [REVIEW] RF: Add Poisson deviance impurity criterion Aug 19, 2021
Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poisson implementation looks good. I'd like to see the comment updated as best you can explaining how we got our formula. Also some c++ unit tests just for the objective class and at least one python level test (probably you can just add poisson to the parameters of some existing test).

cpp/src/decisiontree/batched-levelalgo/metrics.cuh Outdated Show resolved Hide resolved
python/cuml/dask/ensemble/randomforestregressor.py Outdated Show resolved Hide resolved
cpp/src/decisiontree/batched-levelalgo/metrics.cuh Outdated Show resolved Hide resolved
cpp/src/decisiontree/batched-levelalgo/metrics.cuh Outdated Show resolved Hide resolved
cpp/src/decisiontree/batched-levelalgo/metrics.cuh Outdated Show resolved Hide resolved
cpp/test/sg/rf_test.cu Outdated Show resolved Hide resolved
cpp/test/sg/rf_test.cu Outdated Show resolved Hide resolved
cpp/test/sg/rf_test.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes all look good. Can we get at least one python level test? It can be very simple or an extension of existing tests. We can be confident that the C++ code is working correctly, but we want to check that the interface is correctly plumbed into this code.

@venkywonka
Copy link
Contributor Author

sure Rory, I added it yesterday night, but building and testing took time so called it a day, would push them in a bit 👍

@venkywonka venkywonka added breaking Breaking change improvement Improvement / enhancement to an existing function labels Sep 11, 2021
@venkywonka
Copy link
Contributor Author

rerun tests

@venkywonka venkywonka added 5 - Ready to Merge Testing and reviews complete, ready to merge 3 - Ready for Review Ready for review by team and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Sep 14, 2021
@RAMitchell
Copy link
Contributor

Just had another look at the failing tests - as they are only regression, I would take a careful look at your changes to the MSE loss function for any subtle changes in behaviour, e.g. handling of edge cases like having only one or two data points.

@venkywonka
Copy link
Contributor Author

rerun tests

@@ -74,16 +74,12 @@ class RandomForestClassifier(BaseRandomForestModel, DelayedPredictionMixin,
run different models concurrently in different streams by creating
handles in several streams.
If it is None, a new one is created.
split_criterion : The criterion used to split nodes.
0 for GINI, 1 for ENTROPY, 4 for CRITERION_END.
split_criterion : int (default = 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have fixed it in this commit rory.

Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving pending conflict resolution

@dantegd dantegd added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 21, 2021
@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@9f9e16f). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #4156   +/-   ##
===============================================
  Coverage                ?   86.07%           
===============================================
  Files                   ?      231           
  Lines                   ?    18634           
  Branches                ?        0           
===============================================
  Hits                    ?    16040           
  Misses                  ?     2594           
  Partials                ?        0           
Flag Coverage Δ
dask 47.05% <0.00%> (?)
non-dask 78.74% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f9e16f...a3fa800. Read the comment docs.

@dantegd
Copy link
Member

dantegd commented Sep 22, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 45a5a54 into rapidsai:branch-21.10 Sep 22, 2021
rapids-bot bot pushed a commit that referenced this pull request Oct 12, 2021
This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests.


---


checklist:
- [x] Add Gamma and Inverse Gaussian Objective classes
- [x] Add C++ tests for above
- [x] Add remaining C++ tests for other objective functions: entropy and mean squared error
- [x] Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in #4156 )
- [x] Check for regressions by benchmarking on gbm-bench
- [x] Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (`mse`)

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Rory Mitchell (https://github.com/RAMitchell)
  - William Hicks (https://github.com/wphicks)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #4216
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
* Adds the poisson impurity criterion to RF, in parity with scikit learn's RF regressor [[here](https://scikit-learn.org/stable/modules/tree.html#regression-criteria)]
EDIT:
* Also adds C++ level testing for RF Objective function gains of Poisson and Gini.

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Rory Mitchell (https://github.com/RAMitchell)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4156
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests.


---


checklist:
- [x] Add Gamma and Inverse Gaussian Objective classes
- [x] Add C++ tests for above
- [x] Add remaining C++ tests for other objective functions: entropy and mean squared error
- [x] Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in rapidsai#4156 )
- [x] Check for regressions by benchmarking on gbm-bench
- [x] Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (`mse`)

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Rory Mitchell (https://github.com/RAMitchell)
  - William Hicks (https://github.com/wphicks)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4216
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants