Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add criterion to sksurv.ensemble.RandomSurvivalForest #108

Open
arturomoncadatorres opened this issue Apr 16, 2020 · 4 comments
Open

Add criterion to sksurv.ensemble.RandomSurvivalForest #108

arturomoncadatorres opened this issue Apr 16, 2020 · 4 comments

Comments

@arturomoncadatorres
Copy link

arturomoncadatorres commented Apr 16, 2020

It would be fantastic to have criterion (i.e., the function to measure the quality of a split) as a parameter of RandomSurvivalForest. I know that currently only the log-rank splitting rule is supported. For now, this could be set as the default (and only option). In the future, this could be expanded to cover other options (for example, from the original paper conservation, log_rank_score_rule, log_rank_random) - changing the corresponding splitting code as well. This would also make the RandomSurvivalForest more similar to its scikit counterparts (e.g., RandomForestRegressor), making it (even) more compatible with other packages that build on scikit's standard structure.

I think this could be done easily in forest.py:

    def __init__(self,
                 n_estimators=100,
                 #-->
                 criterion="log_rank",
                 #-->
                 max_depth=None,
                 min_samples_split=6,
                 min_samples_leaf=3,
                 min_weight_fraction_leaf=0.,
                 max_features="auto",
                 max_leaf_nodes=None,
                 bootstrap=True,
                 oob_score=False,
                 n_jobs=None,
                 random_state=None,
                 verbose=0,
                 warm_start=False):
        super().__init__(
            base_estimator=SurvivalTree(),
            n_estimators=n_estimators,
            #-->
            criterion=criterion,
            #-->
            estimator_params=("max_depth",
                              "min_samples_split",
                              "min_samples_leaf",
                              "min_weight_fraction_leaf",
                              "max_features",
                              "max_leaf_nodes",
                              "random_state"),
            bootstrap=bootstrap,
            oob_score=oob_score,
            n_jobs=n_jobs,
            random_state=random_state,
            verbose=verbose,
            warm_start=warm_start)

        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.min_weight_fraction_leaf = min_weight_fraction_leaf
        self.max_features = max_features
        self.max_leaf_nodes = max_leaf_nodes

If this is something you think it might be interesting, I would be more than happy to help with a proper PR request.

@james-sexton96
Copy link

Since this was posted, there's a growing literature suggesting that the time-varying nature of some features would necessitate alternative splitting strategies in RSF's.

Having only a single strategy (log-rank) that is subject to some of the same proportionality assumptions of a Cox Regression might defeat the purpose of a model ideally designed for non-linear problems.

Having at least one alternative option like a Poisson regression log-likelihood could offer an intermediate solution before open-ended splitting strategies become available.

See the following examples of varying splitting strategies:

@sebp
Copy link
Owner

sebp commented Apr 14, 2023

@james-sexton96 The options for the splitting rule is quite large in the literature. I haven't followed closely the last couple of years, so I'm not sure if a consensus emerged by now. Conditional Inference Forests would definitely be interesting (see #341).

Do you have a reference for the Poisson regression log-likelihood you mentioned?

@james-sexton96
Copy link

james-sexton96 commented Apr 21, 2023

@sebp
Sure thing. See references below.

A poisson regression log-likelihood is well suited for real-world data as opposed to data with structured follow up.
There was an attempt to branch the R package randomforestSRC's survival functionality (RF-SLAM paper by Wongvibulsin below). However, both this branch and the original package appear to be unsupported.

It would be nice to mirror sklearn's random forest regressor's parameters by including a kwa for criterion, and if I have time, I can draft an implementation of a poisson split criteria!

Crowther et al. 2012
Autsin P. 2017
Wongvibulsin et al. 2019

@james-sexton96
Copy link

See also, poisson criteria added to sci-kit learn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants