Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] update RF docs #4138

Merged
merged 56 commits into from
Oct 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
d62d032
update docsstrings and add std::round
venkywonka Jul 30, 2021
994c10e
suggest alternatives for GPU inference
venkywonka Aug 2, 2021
efb1773
update previous commit for regressor docs
venkywonka Aug 2, 2021
7de63e5
copyright fix
venkywonka Aug 2, 2021
d298d30
flake8 fix
venkywonka Aug 5, 2021
c1bf494
change default estimators in dask RF, consmetics changes
venkywonka Aug 5, 2021
ceee023
add poisson deviance loss
venkywonka Aug 11, 2021
a40c323
sign bug fix
venkywonka Aug 12, 2021
8cd1ce1
modify proxy impurity, refactor tests, clang fix
venkywonka Aug 19, 2021
c185c80
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Aug 24, 2021
dca32f9
add tests for poisson & gini objectives, bug fixes and other refactors
venkywonka Aug 31, 2021
6039045
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Aug 31, 2021
925116d
FIX clang format
venkywonka Aug 31, 2021
3142caf
FIX clang format
venkywonka Aug 31, 2021
9676818
remove debug code
venkywonka Aug 31, 2021
c52c29f
address review comments
venkywonka Sep 2, 2021
36615c3
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Sep 3, 2021
c0c5948
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Sep 7, 2021
79f00b8
add python level test
venkywonka Sep 11, 2021
13c3386
FIX clang format
venkywonka Sep 13, 2021
0332cc6
flake fix, reduce test load
venkywonka Sep 13, 2021
0a5d52a
fix tests, remove artifacts
venkywonka Sep 13, 2021
3255323
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Sep 13, 2021
959ee2c
purge artifacts
venkywonka Sep 13, 2021
5a5410e
decrease tolerance
venkywonka Sep 13, 2021
59caf11
remove min_impurity_decrease member
venkywonka Sep 16, 2021
fd42fb7
fix accuracy bug and dask docstring duplication
venkywonka Sep 17, 2021
9247988
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Sep 17, 2021
a31512d
fix doctring slip
venkywonka Sep 17, 2021
493f847
merge resolution
venkywonka Sep 17, 2021
aec9d26
merge with poisson branch
venkywonka Sep 20, 2021
db09e0f
add tweedie losses
venkywonka Sep 21, 2021
e63754a
refactor unit tests
venkywonka Sep 22, 2021
2e14991
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Sep 23, 2021
78b0ffd
add tests for entropy and mse
venkywonka Sep 24, 2021
1fbff95
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Sep 30, 2021
5f22047
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Sep 30, 2021
43e5b71
Merge branch 'branch-21.12' of https://github.com/rapidsai/cuml into …
venkywonka Oct 4, 2021
11b2f4e
add python tests and refactor objectives
venkywonka Oct 4, 2021
2fa43d7
FIX clang format
venkywonka Oct 4, 2021
87395ff
reduce division operations
venkywonka Oct 5, 2021
8464628
flake fix and change criterion_dict
venkywonka Oct 5, 2021
d764562
make objective data members private
venkywonka Oct 5, 2021
68ecabb
refactor declaration
venkywonka Oct 6, 2021
6eeeac0
Merge branch 'branch-21.12' of https://github.com/rapidsai/cuml into …
venkywonka Oct 6, 2021
b1be698
fix improper merge
venkywonka Oct 6, 2021
a7bc7fe
Merge branch 'fea-ext-tweedie-loss' into enh-ext-rf-docs-update
venkywonka Oct 6, 2021
d1e369d
refactor new changes to docs
venkywonka Oct 11, 2021
16dfafb
prune artifacts
venkywonka Oct 11, 2021
dfed483
Merge branch 'branch-21.12' of https://github.com/rapidsai/cuml into …
venkywonka Oct 18, 2021
7872e23
flake fix
venkywonka Oct 19, 2021
9ed7072
Delete artifact
venkywonka Oct 19, 2021
4ec326e
undo extra backtick causing test-fail
venkywonka Oct 19, 2021
6da3a84
Merge branch 'enh-ext-rf-docs-update' of https://github.com/venkywonk…
venkywonka Oct 19, 2021
9e24756
undo a cosmetic change due to a pytest dependence
venkywonka Oct 19, 2021
3f8af46
address review comments
venkywonka Oct 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cpp/include/cuml/tree/decisiontree.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@ namespace DT {

struct DecisionTreeParams {
/**
* Maximum tree depth. Unlimited (e.g., until leaves are pure), if -1.
* Maximum tree depth. Unlimited (e.g., until leaves are pure), If `-1`.
*/
int max_depth;
/**
* Maximum leaf nodes per tree. Soft constraint. Unlimited, if -1.
* Maximum leaf nodes per tree. Soft constraint. Unlimited, If `-1`.
*/
int max_leaves;
/**
Expand Down
3 changes: 1 addition & 2 deletions python/cuml/benchmark/ci_benchmark.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2019, NVIDIA CORPORATION.
# Copyright (c) 2019-2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -173,7 +173,6 @@ def make_bench_configs(long_config):
bench_dims=default_dims,
cuml_param_override_list=[
{"n_bins": [8, 32]},
{"split_algo": [0, 1]},
{"max_features": ['sqrt', 1.0]},
],
)
Expand Down
86 changes: 48 additions & 38 deletions python/cuml/dask/ensemble/randomforestclassifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,14 +49,14 @@ class RandomForestClassifier(BaseRandomForestModel, DelayedPredictionMixin,
Future versions of the API will support more flexible data
distribution and additional input types.

The distributed algorithm uses an embarrassingly-parallel
approach. For a forest with N trees being built on w workers, each
worker simply builds N/w trees on the data it has available
The distributed algorithm uses an *embarrassingly-parallel*
approach. For a forest with `N` trees being built on `w` workers, each
worker simply builds `N/w` trees on the data it has available
locally. In many cases, partitioning the data so that each worker
builds trees on a subset of the total dataset works well, but
it generally requires the data to be well-shuffled in advance.
Alternatively, callers can replicate all of the data across
workers so that rf.fit receives w partitions, each containing the
workers so that ``rf.fit`` receives `w` partitions, each containing the
same data. This would produce results approximately identical to
single-GPU fitting.

Expand All @@ -65,7 +65,7 @@ class RandomForestClassifier(BaseRandomForestModel, DelayedPredictionMixin,

Parameters
-----------
n_estimators : int (default = 10)
n_estimators : int (default = 100)
total number of trees in the forest (not per-worker)
handle : cuml.Handle
Specifies the cuml.handle that holds internal CUDA state for
Expand All @@ -74,43 +74,54 @@ class RandomForestClassifier(BaseRandomForestModel, DelayedPredictionMixin,
run different models concurrently in different streams by creating
handles in several streams.
If it is None, a new one is created.
split_criterion : int or string (default = 0 ('gini'))
The criterion used to split nodes.
0 or 'gini' for GINI, 1 or 'entropy' for ENTROPY,
2 or 'mse' for MSE,
4 or 'poisson' for POISSON,
5 or 'gamma' for GAMMA,
6 or 'inverse_gaussian' for INVERSE_GAUSSIAN,
2, 'mse', 4, 'poisson', 5, 'gamma', 6, 'inverse_gaussian' not valid
for classification
split_criterion : int or string (default = ``0`` (``'gini'``))
The criterion used to split nodes.\n
* ``0`` or ``'gini'`` for gini impurity
* ``1`` or ``'entropy'`` for information gain (entropy)
* ``2`` or ``'mse'`` for mean squared error
* ``4`` or ``'poisson'`` for poisson half deviance
* ``5`` or ``'gamma'`` for gamma half deviance
* ``6`` or ``'inverse_gaussian'`` for inverse gaussian deviance
``2``, ``'mse'``, ``4``, ``'poisson'``, ``5``, ``'gamma'``, ``6``,
``'inverse_gaussian'`` not valid for classification
bootstrap : boolean (default = True)
Control bootstrapping.
If set, each tree in the forest is built
on a bootstrapped sample with replacement.
If False, the whole dataset is used to build each tree.
Control bootstrapping.\n
* If ``True``, each tree in the forest is built on a bootstrapped
sample with replacement.
* If ``False``, the whole dataset is used to build each tree.
max_samples : float (default = 1.0)
Ratio of dataset rows used while fitting each tree.
max_depth : int (default = -1)
Maximum tree depth. Unlimited (i.e, until leaves are pure), if -1.
Maximum tree depth. Unlimited (i.e, until leaves are pure), If ``-1``.
max_leaves : int (default = -1)
Maximum leaf nodes per tree. Soft constraint. Unlimited, if -1.
Maximum leaf nodes per tree. Soft constraint. Unlimited, If ``-1``.
max_features : float (default = 'auto')
Ratio of number of features (columns) to consider
per node split.
n_bins : int (default = 8)
per node split.\n
* If type ``int`` then ``max_features`` is the absolute count of
features to be used.
* If type ``float`` then ``max_features`` is a fraction.
* If ``'auto'`` then ``max_features=n_features = 1.0``.
* If ``'sqrt'`` then ``max_features=1/sqrt(n_features)``.
* If ``'log2'`` then ``max_features=log2(n_features)/n_features``.
* If ``None``, then ``max_features = 1.0``.
n_bins : int (default = 128)
Number of bins used by the split algorithm.
min_samples_leaf : int or float (default = 1)
The minimum number of samples (rows) in each leaf node.
If int, then min_samples_leaf represents the minimum number.
If float, then min_samples_leaf represents a fraction and
ceil(min_samples_leaf * n_rows) is the minimum number of samples
for each leaf node.
The minimum number of samples (rows) in each leaf node.\n
* If type ``int``, then ``min_samples_leaf`` represents the minimum
number.
* If ``float``, then ``min_samples_leaf`` represents a fraction
and ``ceil(min_samples_leaf * n_rows)`` is the minimum number of
samples for each leaf node.
min_samples_split : int or float (default = 2)
The minimum number of samples required to split an internal node.
If int, then min_samples_split represents the minimum number.
If float, then min_samples_split represents a fraction and
ceil(min_samples_split * n_rows) is the minimum number of samples
for each split.
The minimum number of samples required to split an internal
node.\n
* If type ``int``, then ``min_samples_split`` represents the minimum
number.
* If type ``float``, then ``min_samples_split`` represents a fraction
and ``ceil(min_samples_split * n_rows)`` is the minimum number of
samples for each split.
n_streams : int (default = 4 )
Number of parallel streams used for forest building
workers : optional, list of strings
Expand Down Expand Up @@ -139,7 +150,7 @@ def __init__(
workers=None,
client=None,
verbose=False,
n_estimators=10,
n_estimators=100,
random_state=None,
ignore_empty_partitions=False,
**kwargs
Expand Down Expand Up @@ -330,7 +341,7 @@ def predict(self, X, algo='auto', threshold=0.5,
for inference.

Returns
----------
-------
y : Dask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

"""
Expand Down Expand Up @@ -404,8 +415,9 @@ def predict_model_on_cpu(self, X, convert_dtype=True):
When set to True, the predict method will, when necessary, convert
the input to the data type which was used to train the model. This
will increase memory used for the method.

Returns
----------
-------
y : Dask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
"""
c = default_client()
Expand Down Expand Up @@ -501,9 +513,7 @@ def predict_proba(self, X,

Returns
-------
y : NumPy
Dask cuDF dataframe or CuPy backed Dask Array (n_rows, n_classes)

y : Dask cuDF dataframe or CuPy backed Dask Array (n_rows, n_classes)
"""
if self._get_internal_model() is None:
self._set_internal_model(self._concat_treelite_models())
Expand Down
92 changes: 48 additions & 44 deletions python/cuml/dask/ensemble/randomforestregressor.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,14 +42,14 @@ class RandomForestRegressor(BaseRandomForestModel, DelayedPredictionMixin,
distribution and additional input types. User-facing APIs are
expected to change in upcoming versions.

The distributed algorithm uses an embarrassingly-parallel
approach. For a forest with N trees being built on w workers, each
worker simply builds N/w trees on the data it has available
The distributed algorithm uses an *embarrassingly-parallel*
approach. For a forest with `N` trees being built on `w` workers, each
worker simply builds `N/w` trees on the data it has available
locally. In many cases, partitioning the data so that each worker
builds trees on a subset of the total dataset works well, but
it generally requires the data to be well-shuffled in advance.
Alternatively, callers can replicate all of the data across
workers so that rf.fit receives w partitions, each containing the
workers so that ``rf.fit`` receives `w` partitions, each containing the
same data. This would produce results approximately identical to
single-GPU fitting.

Expand All @@ -58,7 +58,7 @@ class RandomForestRegressor(BaseRandomForestModel, DelayedPredictionMixin,

Parameters
-----------
n_estimators : int (default = 10)
n_estimators : int (default = 100)
total number of trees in the forest (not per-worker)
handle : cuml.Handle
Specifies the cuml.handle that holds internal CUDA state for
Expand All @@ -67,56 +67,60 @@ class RandomForestRegressor(BaseRandomForestModel, DelayedPredictionMixin,
run different models concurrently in different streams by creating
handles in several streams.
If it is None, a new one is created.
split_criterion : int or string (default = 2 ('mse'))
The criterion used to split nodes.
0 or 'gini' for GINI, 1 or 'entropy' for ENTROPY,
2 or 'mse' for MSE,
4 or 'poisson' for POISSON,
5 or 'gamma' for GAMMA,
6 or 'inverse_gaussian' for INVERSE_GAUSSIAN,
0, 'gini', 1, 'entropy' not valid for regression
split_criterion : int or string (default = ``2`` (``'mse'``))
The criterion used to split nodes.\n
* ``0`` or ``'gini'`` for gini impurity
* ``1`` or ``'entropy'`` for information gain (entropy)
* ``2`` or ``'mse'`` for mean squared error
* ``4`` or ``'poisson'`` for poisson half deviance
* ``5`` or ``'gamma'`` for gamma half deviance
* ``6`` or ``'inverse_gaussian'`` for inverse gaussian deviance
``0``, ``'gini'``, ``1``, ``'entropy'`` not valid for regression
bootstrap : boolean (default = True)
Control bootstrapping.
If set, each tree in the forest is built
on a bootstrapped sample with replacement.
If False, the whole dataset is used to build each tree.
Control bootstrapping.\n
* If ``True``, each tree in the forest is built on a bootstrapped
sample with replacement.
* If ``False``, the whole dataset is used to build each tree.
max_samples : float (default = 1.0)
Ratio of dataset rows used while fitting each tree.
max_depth : int (default = -1)
Maximum tree depth. Unlimited (i.e, until leaves are pure), if -1.
Maximum tree depth. Unlimited (i.e, until leaves are pure), If ``-1``.
max_leaves : int (default = -1)
Maximum leaf nodes per tree. Soft constraint. Unlimited, if -1.
max_features : int or float or string or None (default = 'auto')
Maximum leaf nodes per tree. Soft constraint. Unlimited, If ``-1``.
max_features : float (default = 'auto')
Ratio of number of features (columns) to consider
per node split.
If int then max_features/n_features.
If float then max_features is a fraction.
If 'auto' then max_features=n_features which is 1.0.
If 'sqrt' then max_features=1/sqrt(n_features).
If 'log2' then max_features=log2(n_features)/n_features.
If None, then max_features=n_features which is 1.0.
n_bins : int (default = 8)
per node split.\n
* If type ``int`` then ``max_features`` is the absolute count of
features to be used.
* If type ``float`` then ``max_features`` is a fraction.
* If ``'auto'`` then ``max_features=n_features = 1.0``.
* If ``'sqrt'`` then ``max_features=1/sqrt(n_features)``.
* If ``'log2'`` then ``max_features=log2(n_features)/n_features``.
* If ``None``, then ``max_features = 1.0``.
n_bins : int (default = 128)
Number of bins used by the split algorithm.
min_samples_leaf : int or float (default = 1)
The minimum number of samples (rows) in each leaf node.
If int, then min_samples_leaf represents the minimum number.
If float, then min_samples_leaf represents a fraction and
ceil(min_samples_leaf * n_rows) is the minimum number of samples
for each leaf node.
The minimum number of samples (rows) in each leaf node.\n
* If type ``int``, then ``min_samples_leaf`` represents the minimum
number.
* If ``float``, then ``min_samples_leaf`` represents a fraction and
``ceil(min_samples_leaf * n_rows)`` is the minimum number of
samples for each leaf node.
min_samples_split : int or float (default = 2)
The minimum number of samples required to split an internal node.
If int, then min_samples_split represents the minimum number.
If float, then min_samples_split represents a fraction and
ceil(min_samples_split * n_rows) is the minimum number of samples
for each split.
The minimum number of samples required to split an internal node.\n
* If type ``int``, then ``min_samples_split`` represents the minimum
number.
* If type ``float``, then ``min_samples_split`` represents a fraction
and ``ceil(min_samples_split * n_rows)`` is the minimum number of
samples for each split.
accuracy_metric : string (default = 'r2')
Decides the metric used to evaluate the performance of the model.
In the 0.16 release, the default scoring metric was changed
from mean squared error to r-squared.
for r-squared : 'r2'
for median of abs error : 'median_ae'
for mean of abs error : 'mean_ae'
for mean square error' : 'mse'
from mean squared error to r-squared.\n
* for r-squared : ``'r2'``
* for median of abs error : ``'median_ae'``
* for mean of abs error : ``'mean_ae'``
* for mean square error' : ``'mse'``
n_streams : int (default = 4 )
Number of parallel streams used for forest building
workers : optional, list of strings
Expand All @@ -141,7 +145,7 @@ def __init__(
workers=None,
client=None,
verbose=False,
n_estimators=10,
n_estimators=100,
random_state=None,
ignore_empty_partitions=False,
**kwargs
Expand Down
6 changes: 3 additions & 3 deletions python/cuml/ensemble/randomforest_common.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ class BaseRandomForestModel(Base):
'bootstrap',
'verbose', 'max_samples',
'max_leaves',
'accuracy_metric',
'max_batch_size', 'n_streams', 'dtype',
'accuracy_metric', 'max_batch_size',
'n_streams', 'dtype',
'output_type', 'min_weight_fraction_leaf', 'n_jobs',
'max_leaf_nodes', 'min_impurity_split', 'oob_score',
'random_state', 'warm_start', 'class_weight',
Expand Down Expand Up @@ -106,7 +106,7 @@ class BaseRandomForestModel(Base):
if ((random_state is not None) and (n_streams != 1)):
warnings.warn("For reproducible results in Random Forest"
" Classifier or for almost reproducible results"
" in Random Forest Regressor, n_streams==1 is "
" in Random Forest Regressor, n_streams=1 is "
"recommended. If n_streams is > 1, results may vary "
"due to stream/thread timing differences, even when "
"random_state is set")
Expand Down
Loading