[R-package] add support for specifying training indices in lgb.cv() #3924

julioasotodv · 2021-02-08T19:12:38Z

Summary

The addition of a train_folds argument in lgb.cv() would allow for more fine-grained folds generation that is useful in some scenarios, such as time series forecasting (just like xgboost R package does).

Motivation

The R function lgb.cv() currently has got an argument that allows you to specify manual folds through the folds argument. This argument expects a list of indices that should go to the test set for each fold, and all the other indices will go to the train set.

However, in some types of datasets and tasks (such as in time series), you may actually want to have different folds where some indices are just not used for that specific fold, neither in the train or test sets (just to avoid leaking information from the future).

Description

The Xgboost R package included this feature a while back, which essentially consists of adding one more argument to the cv() function called train_folds. This way, train_folds, if specified, makes sure that only those indices will go to the train set in each fold. If it is not specified, the train indices will just be the opposite of the ones in the folds argument, just like lgb.cv() works right now.

References

Please see the train_folds argument in xgb.cv() here, and the relevant code in xgb can be found here

Thank you!

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-02-13T05:50:32Z

Hi @julioasotodv . Thanks for using LightGBM, and for taking the time to write up this excellent feature request!

Adding a few more details for whenever this is picked up

PRs where {xgboost} added this feature: Added new train_folds parameter dmlc/xgboost#5114, Added new train_folds parameter for xgb.cv() dmlc/xgboost#5064
Stack Overflow question that prompted the train_folds feature in {xgboost}: https://stackoverflow.com/questions/32433458/how-to-specify-train-and-test-indices-for-xgb-cv-in-r-package-xgboost/51412073

I think this example from that Stack Overflow question describes the situation well

# assume i have 100 strips of time-series data, where each strip is X_i
# validate only on 10 points after training
fold1:  train on X_1-X_10, validate on X_11-X_20
fold2:  train on X_1-X_20, validate on X_21-X_30
fold3:  train on X_1-X_30, validate on X_31-X_40
...

I've added this feature to #2302, where we organize the list of feature requests. I've also edited the title so that it's understandable for people who don't have prior knowledge of the train_folds argument in {xgboost}.

@julioasotodv are you interested in contributing this feature?

As seen in issue microsoft#3924

julioasotodv · 2021-02-16T12:25:45Z

@jameslamb sure! I just did. I will create a PR soon!

jameslamb · 2021-09-01T05:24:42Z

At the moment, this feature is not being actively worked on (see #3989 (comment)).

Per the current policy in this repository, I'm going to close this GitHub issue but keep the feature marked as open in #2302. Anyone who is interested in contributing this feature or who wants to add something to this discussion is encouraged to comment here, and the issue can be re-opened.

jameslamb added feature request r-package labels Feb 8, 2021

jameslamb changed the title ~~[R package] train_folds argument in lgb.cv?~~ [R package] add support for specifying training indices in lgb.cv() Feb 13, 2021

jameslamb changed the title ~~[R package] add support for specifying training indices in lgb.cv()~~ [R-package] add support for specifying training indices in lgb.cv() Feb 13, 2021

jameslamb mentioned this issue Feb 13, 2021

Feature Requests & Voting Hub #2302

Open

julioasotodv pushed a commit to julioasotodv/LightGBM that referenced this issue Feb 16, 2021

Add support for specifying training indices in lgb.cv()

37888e0

As seen in issue microsoft#3924

julioasotodv mentioned this issue Feb 16, 2021

WIP: [R-package] Add support for specifying training indices in lgb.cv() #3989

Closed

jameslamb closed this as completed Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] add support for specifying training indices in lgb.cv() #3924

[R-package] add support for specifying training indices in lgb.cv() #3924

julioasotodv commented Feb 8, 2021

jameslamb commented Feb 13, 2021 •

edited

Loading

julioasotodv commented Feb 16, 2021

jameslamb commented Sep 1, 2021

[R-package] add support for specifying training indices in lgb.cv() #3924

[R-package] add support for specifying training indices in lgb.cv() #3924

Comments

julioasotodv commented Feb 8, 2021

Summary

Motivation

Description

References

jameslamb commented Feb 13, 2021 • edited Loading

julioasotodv commented Feb 16, 2021

jameslamb commented Sep 1, 2021

jameslamb commented Feb 13, 2021 •

edited

Loading