Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] add support for specifying training indices in lgb.cv() #3924

Closed
julioasotodv opened this issue Feb 8, 2021 · 3 comments
Closed

Comments

@julioasotodv
Copy link
Contributor

Summary

The addition of a train_folds argument in lgb.cv() would allow for more fine-grained folds generation that is useful in some scenarios, such as time series forecasting (just like xgboost R package does).

Motivation

The R function lgb.cv() currently has got an argument that allows you to specify manual folds through the folds argument. This argument expects a list of indices that should go to the test set for each fold, and all the other indices will go to the train set.

However, in some types of datasets and tasks (such as in time series), you may actually want to have different folds where some indices are just not used for that specific fold, neither in the train or test sets (just to avoid leaking information from the future).

Description

The Xgboost R package included this feature a while back, which essentially consists of adding one more argument to the cv() function called train_folds. This way, train_folds, if specified, makes sure that only those indices will go to the train set in each fold. If it is not specified, the train indices will just be the opposite of the ones in the folds argument, just like lgb.cv() works right now.

References

Please see the train_folds argument in xgb.cv() here, and the relevant code in xgb can be found here

Thank you!

@jameslamb
Copy link
Collaborator

jameslamb commented Feb 13, 2021

Hi @julioasotodv . Thanks for using LightGBM, and for taking the time to write up this excellent feature request!

Adding a few more details for whenever this is picked up

I think this example from that Stack Overflow question describes the situation well

# assume i have 100 strips of time-series data, where each strip is X_i
# validate only on 10 points after training
fold1:  train on X_1-X_10, validate on X_11-X_20
fold2:  train on X_1-X_20, validate on X_21-X_30
fold3:  train on X_1-X_30, validate on X_31-X_40
...

I've added this feature to #2302, where we organize the list of feature requests. I've also edited the title so that it's understandable for people who don't have prior knowledge of the train_folds argument in {xgboost}.

@julioasotodv are you interested in contributing this feature?

@jameslamb jameslamb changed the title [R package] train_folds argument in lgb.cv? [R package] add support for specifying training indices in lgb.cv() Feb 13, 2021
@jameslamb jameslamb changed the title [R package] add support for specifying training indices in lgb.cv() [R-package] add support for specifying training indices in lgb.cv() Feb 13, 2021
julioasotodv pushed a commit to julioasotodv/LightGBM that referenced this issue Feb 16, 2021
@julioasotodv
Copy link
Contributor Author

@jameslamb sure! I just did. I will create a PR soon!

@jameslamb
Copy link
Collaborator

At the moment, this feature is not being actively worked on (see #3989 (comment)).

Per the current policy in this repository, I'm going to close this GitHub issue but keep the feature marked as open in #2302. Anyone who is interested in contributing this feature or who wants to add something to this discussion is encouraged to comment here, and the issue can be re-opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants