[R] Add evaluation set and early stopping for `xgboost()` #11065

david-cortes · 2024-12-06T15:35:08Z

This PR adds the evaluation set and early stopping functionality to the new xgboost() function.

Due to the data processings that this new interface does (e.g. encoding of factors for both 'x' and 'y'), it only allows specifying the evaluation data as a subset of the x/y data, either as a random fraction or by selected row indices, as otherwise taking lists of data would require a lot of hassle with encodings, variable reorderings, and so on.

trivialfis · 2024-12-08T07:12:33Z

R-package/R/xgb.train.R

-#' @param early_stopping_rounds If `NULL`, the early stopping function is not triggered.
-#'   If set to an integer `k`, training with a validation set will stop if the performance
-#'   doesn't improve for `k` rounds. Setting this parameter engages the [xgb.cb.early.stop()] callback.
+#' @param early_stopping_rounds Number of boosting rounds after which training will be stopped


Is there a defined behavior for using multiple metrics or multiple evals in R? In Python, the last metric and the last validation dataset is used for early stopping.

It's the same in R. Updated the docs.

trivialfis · 2024-12-08T16:49:49Z

R-package/R/xgboost.R

@@ -512,6 +516,9 @@ check.can.use.qdm <- function(x, params) {
      return(FALSE)
    }
  }
+  if (NROW(eval_set)) {


What does this imply? If eval_set has a valid number of rows then we can't use qdm?

Yes, because it then slices the DMatrix that gets created.

Does the slicing need to happen after the DMatrix is created?

Yes, because otherwise there'd be issues with things like needing to make sure categorical 'y' and features have the same encodings between the two sets, objects from package Matrix returning a different class when they are sliced, and so on.

Hmm, so from your perspective the DMatrix is more suitable for slicing than built-in classes...

We are planning to work on CV with shared quantile cuts for improved computation performance. (Sharing the quantiles between QDM folds). It's a minor information leak but can significantly increase performance, especially with external memory.

As a result, I have to consider how this can be implemented. If we double down on the DMatrix slicing, it will prevent us from the optimization. It's very unlikely that we can slice an external memory DMatrix. Also, the slice method in XGBoost is quite slow and memory inefficient.

I can merge this PR as it's, but I think we might have more troubles when applying the optimization for CV.

Well, the alternative would be to:

Restructure the code such that 'x' and 'y' get processed earlier.

Make it so that characters in 'x' get converted to factor before passing them to xgb.DMatrix, and so that data.frames get their subclasses (like data.table) removed beforehand.

Add a custom slicer for Matrix classes. Either that, or additional castings of storage format, or pull an extra dependency for efficient slicing.

But it'd end up being inefficient either way. The moreso considering that on the R side, the slicing would happen with a vector of random indices on one of the following:

A column-major matrix of 8-byte elements (likely slower than a CSR).

A list of vectors (what a dataframe is behind the scenes).

A CSC matrix, which first will get converted to that format from CSR (Matrix doesn't slice CSR directly), and the slice getting converted to COO.

It could in theory be more efficient to do the slicing in R for base matrix classes, but probably not so much for the others.

Any suggestion for the future implementation of CV optimization previously mentioned? It's designed for the Qdm and the external memory version of Qdm.

Sounds good and quite helpful for the CV function indeed. But I don't think it's very relevant here, since unlike xgb.cv, (a) it's only doing two slicing operations and only one quantization/binning, (b) it doesn't accept an arbitrary xgb.DMatrix - instead, it creates it internally from a small subset of allowed classes.

Hence, it doesn't need to consider special cases like external memory or distributed mode, and there isn't too much room for improvement in terms of speed savings.

Makes sense. We will have a second cv function in the future for high-level inputs (like data.table and iterator etc).

david-cortes · 2024-12-08T19:10:57Z

Just realized that verbosity wasn't being passed from xgboost() to params. Fixed it here, so that there's only one verbosity parameter to control.

trivialfis

Will merge after the CI is back online.

david-cortes and others added 5 commits December 6, 2024 16:30

add eval_set option for xgboost()

99125d8

linter

67992e2

avoid problems with CRAN checks

ce669a8

Merge branch 'master' into eval_arg

d53c5ea

add 'print_every_n' argument

594b76f

trivialfis reviewed Dec 8, 2024

View reviewed changes

describe behavior of 'eval_metric' on early stopping

5339e99

trivialfis reviewed Dec 8, 2024

View reviewed changes

david-cortes added 3 commits December 8, 2024 20:07

print by default when passing eval_set

903db11

Merge branch 'master' into eval_arg

be47a11

pass verbosity to params

d80f0f9

david-cortes added 2 commits December 10, 2024 20:41

solve conflicts

485c91b

correct eval docs for xgboost() too

a4e44e2

trivialfis approved these changes Dec 10, 2024

View reviewed changes

david-cortes and others added 2 commits December 10, 2024 21:20

missing paragraph

b21b414

Merge branch 'master' into eval_arg

5176c30

trivialfis merged commit b202ebe into dmlc:master Dec 11, 2024
57 of 58 checks passed

david-cortes mentioned this pull request Dec 12, 2024

[R] Silence recently added verbose test #11096

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Add evaluation set and early stopping for `xgboost()` #11065

[R] Add evaluation set and early stopping for `xgboost()` #11065

david-cortes commented Dec 6, 2024

trivialfis Dec 8, 2024

david-cortes Dec 8, 2024

trivialfis Dec 8, 2024

david-cortes Dec 8, 2024

trivialfis Dec 8, 2024

david-cortes Dec 8, 2024

trivialfis Dec 8, 2024

trivialfis Dec 9, 2024

david-cortes Dec 9, 2024

trivialfis Dec 10, 2024 •

edited

Loading

david-cortes Dec 10, 2024

trivialfis Dec 10, 2024

david-cortes commented Dec 8, 2024

trivialfis left a comment

[R] Add evaluation set and early stopping for xgboost() #11065

[R] Add evaluation set and early stopping for xgboost() #11065

Conversation

david-cortes commented Dec 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-cortes commented Dec 8, 2024

trivialfis left a comment

Choose a reason for hiding this comment

[R] Add evaluation set and early stopping for `xgboost()` #11065

[R] Add evaluation set and early stopping for `xgboost()` #11065

trivialfis Dec 10, 2024 •

edited

Loading