Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Add evaluation set and early stopping for xgboost() #11065

Merged
merged 13 commits into from
Dec 11, 2024

Conversation

david-cortes
Copy link
Contributor

ref #9810

This PR adds the evaluation set and early stopping functionality to the new xgboost() function.

Due to the data processings that this new interface does (e.g. encoding of factors for both 'x' and 'y'), it only allows specifying the evaluation data as a subset of the x/y data, either as a random fraction or by selected row indices, as otherwise taking lists of data would require a lot of hassle with encodings, variable reorderings, and so on.

#' @param early_stopping_rounds If `NULL`, the early stopping function is not triggered.
#' If set to an integer `k`, training with a validation set will stop if the performance
#' doesn't improve for `k` rounds. Setting this parameter engages the [xgb.cb.early.stop()] callback.
#' @param early_stopping_rounds Number of boosting rounds after which training will be stopped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a defined behavior for using multiple metrics or multiple evals in R? In Python, the last metric and the last validation dataset is used for early stopping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same in R. Updated the docs.

@@ -512,6 +516,9 @@ check.can.use.qdm <- function(x, params) {
return(FALSE)
}
}
if (NROW(eval_set)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this imply? If eval_set has a valid number of rows then we can't use qdm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because it then slices the DMatrix that gets created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the slicing need to happen after the DMatrix is created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because otherwise there'd be issues with things like needing to make sure categorical 'y' and features have the same encodings between the two sets, objects from package Matrix returning a different class when they are sliced, and so on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so from your perspective the DMatrix is more suitable for slicing than built-in classes...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning to work on CV with shared quantile cuts for improved computation performance. (Sharing the quantiles between QDM folds). It's a minor information leak but can significantly increase performance, especially with external memory.

As a result, I have to consider how this can be implemented. If we double down on the DMatrix slicing, it will prevent us from the optimization. It's very unlikely that we can slice an external memory DMatrix. Also, the slice method in XGBoost is quite slow and memory inefficient.

I can merge this PR as it's, but I think we might have more troubles when applying the optimization for CV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the alternative would be to:

  • Restructure the code such that 'x' and 'y' get processed earlier.
  • Make it so that characters in 'x' get converted to factor before passing them to xgb.DMatrix, and so that data.frames get their subclasses (like data.table) removed beforehand.
  • Add a custom slicer for Matrix classes. Either that, or additional castings of storage format, or pull an extra dependency for efficient slicing.

But it'd end up being inefficient either way. The moreso considering that on the R side, the slicing would happen with a vector of random indices on one of the following:

  • A column-major matrix of 8-byte elements (likely slower than a CSR).
  • A list of vectors (what a dataframe is behind the scenes).
  • A CSC matrix, which first will get converted to that format from CSR (Matrix doesn't slice CSR directly), and the slice getting converted to COO.

It could in theory be more efficient to do the slicing in R for base matrix classes, but probably not so much for the others.

Copy link
Member

@trivialfis trivialfis Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion for the future implementation of CV optimization previously mentioned? It's designed for the Qdm and the external memory version of Qdm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good and quite helpful for the CV function indeed. But I don't think it's very relevant here, since unlike xgb.cv, (a) it's only doing two slicing operations and only one quantization/binning, (b) it doesn't accept an arbitrary xgb.DMatrix - instead, it creates it internally from a small subset of allowed classes.

Hence, it doesn't need to consider special cases like external memory or distributed mode, and there isn't too much room for improvement in terms of speed savings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We will have a second cv function in the future for high-level inputs (like data.table and iterator etc).

@david-cortes
Copy link
Contributor Author

Just realized that verbosity wasn't being passed from xgboost() to params. Fixed it here, so that there's only one verbosity parameter to control.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge after the CI is back online.

@trivialfis trivialfis merged commit b202ebe into dmlc:master Dec 11, 2024
57 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants