Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Add evaluation set and early stopping for xgboost() #11065

Merged
merged 13 commits into from
Dec 11, 2024
12 changes: 9 additions & 3 deletions R-package/R/xgb.train.R
Original file line number Diff line number Diff line change
Expand Up @@ -147,9 +147,15 @@
#' @param print_every_n Print each nth iteration evaluation messages when `verbose>0`.
#' Default is 1 which means all messages are printed. This parameter is passed to the
#' [xgb.cb.print.evaluation()] callback.
#' @param early_stopping_rounds If `NULL`, the early stopping function is not triggered.
#' If set to an integer `k`, training with a validation set will stop if the performance
#' doesn't improve for `k` rounds. Setting this parameter engages the [xgb.cb.early.stop()] callback.
#' @param early_stopping_rounds Number of boosting rounds after which training will be stopped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a defined behavior for using multiple metrics or multiple evals in R? In Python, the last metric and the last validation dataset is used for early stopping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same in R. Updated the docs.

#' if there is no improvement in performance (as measured by the evaluatiation metric that is
#' supplied or selected by default for the objective) on the evaluation data passed under
#' `evals`.
#'
#' Must pass `evals` in order to use this functionality. Setting this parameter adds the
#' [xgb.cb.early.stop()] callback.
#'
#' If `NULL`, early stopping will not be used.
#' @param maximize If `feval` and `early_stopping_rounds` are set, then this parameter must be set as well.
#' When it is `TRUE`, it means the larger the evaluation score the better.
#' This parameter is passed to the [xgb.cb.early.stop()] callback.
Expand Down
172 changes: 167 additions & 5 deletions R-package/R/xgboost.R
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ prescreen.parameters <- function(params) {

prescreen.objective <- function(objective) {
if (!is.null(objective)) {
if (!is.character(objective) || length(objective) != 1L || is.na(objective)) {
stop("'objective' must be a single character/string variable.")
}

if (objective %in% .OBJECTIVES_NON_DEFAULT_MODE()) {
stop(
"Objectives with non-default prediction mode (",
Expand All @@ -30,8 +34,8 @@ prescreen.objective <- function(objective) {
)
}

if (!is.character(objective) || length(objective) != 1L || is.na(objective)) {
stop("'objective' must be a single character/string variable.")
if (objective %in% .RANKING_OBJECTIVES()) {
stop("Ranking objectives are not supported in 'xgboost()'. Try 'xgb.train()'.")
}
}
}
Expand Down Expand Up @@ -501,7 +505,7 @@ check.nthreads <- function(nthreads) {
return(as.integer(nthreads))
}

check.can.use.qdm <- function(x, params) {
check.can.use.qdm <- function(x, params, eval_set) {
if ("booster" %in% names(params)) {
if (params$booster == "gblinear") {
return(FALSE)
Expand All @@ -512,6 +516,9 @@ check.can.use.qdm <- function(x, params) {
return(FALSE)
}
}
if (NROW(eval_set)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this imply? If eval_set has a valid number of rows then we can't use qdm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because it then slices the DMatrix that gets created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the slicing need to happen after the DMatrix is created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because otherwise there'd be issues with things like needing to make sure categorical 'y' and features have the same encodings between the two sets, objects from package Matrix returning a different class when they are sliced, and so on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so from your perspective the DMatrix is more suitable for slicing than built-in classes...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning to work on CV with shared quantile cuts for improved computation performance. (Sharing the quantiles between QDM folds). It's a minor information leak but can significantly increase performance, especially with external memory.

As a result, I have to consider how this can be implemented. If we double down on the DMatrix slicing, it will prevent us from the optimization. It's very unlikely that we can slice an external memory DMatrix. Also, the slice method in XGBoost is quite slow and memory inefficient.

I can merge this PR as it's, but I think we might have more troubles when applying the optimization for CV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the alternative would be to:

  • Restructure the code such that 'x' and 'y' get processed earlier.
  • Make it so that characters in 'x' get converted to factor before passing them to xgb.DMatrix, and so that data.frames get their subclasses (like data.table) removed beforehand.
  • Add a custom slicer for Matrix classes. Either that, or additional castings of storage format, or pull an extra dependency for efficient slicing.

But it'd end up being inefficient either way. The moreso considering that on the R side, the slicing would happen with a vector of random indices on one of the following:

  • A column-major matrix of 8-byte elements (likely slower than a CSR).
  • A list of vectors (what a dataframe is behind the scenes).
  • A CSC matrix, which first will get converted to that format from CSR (Matrix doesn't slice CSR directly), and the slice getting converted to COO.

It could in theory be more efficient to do the slicing in R for base matrix classes, but probably not so much for the others.

Copy link
Member

@trivialfis trivialfis Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion for the future implementation of CV optimization previously mentioned? It's designed for the Qdm and the external memory version of Qdm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good and quite helpful for the CV function indeed. But I don't think it's very relevant here, since unlike xgb.cv, (a) it's only doing two slicing operations and only one quantization/binning, (b) it doesn't accept an arbitrary xgb.DMatrix - instead, it creates it internally from a small subset of allowed classes.

Hence, it doesn't need to consider special cases like external memory or distributed mode, and there isn't too much room for improvement in terms of speed savings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We will have a second cv function in the future for high-level inputs (like data.table and iterator etc).

return(FALSE)
}
return(TRUE)
}

Expand Down Expand Up @@ -717,6 +724,129 @@ process.x.and.col.args <- function(
return(lst_args)
}

process.eval.set <- function(eval_set, lst_args) {
if (!NROW(eval_set)) {
return(NULL)
}
nrows <- nrow(lst_args$dmatrix_args$data)
is_classif <- hasName(lst_args$metadata, "y_levels")
processed_y <- lst_args$dmatrix_args$label
eval_set <- as.vector(eval_set)
if (length(eval_set) == 1L) {

eval_set <- as.numeric(eval_set)
if (is.na(eval_set) || eval_set < 0 || eval_set >= 1) {
stop("'eval_set' as a fraction must be a number between zero and one (non-inclusive).")
}
if (eval_set == 0) {
return(NULL)
}
nrow_eval <- as.integer(round(nrows * eval_set, 0))
if (nrow_eval < 1) {
warning(
"Desired 'eval_set' fraction amounts to zero observations.",
" Will not create evaluation set."
)
return(NULL)
}
nrow_train <- nrows - nrow_eval
if (nrow_train < 2L) {
stop("Desired 'eval_set' fraction would leave less than 2 observations for training data.")
}
if (is_classif && nrow_train < length(lst_args$metadata$y_levels)) {
stop("Desired 'eval_set' fraction would not leave enough samples for each class of 'y'.")
}

seed <- lst_args$params$seed
if (!is.null(seed)) {
set.seed(seed)
}

idx_shuffled <- sample(nrows, nrows, replace = FALSE)
idx_eval <- idx_shuffled[seq(1L, nrow_eval)]
idx_train <- idx_shuffled[seq(nrow_eval + 1L, nrows)]
# Here we want the training set to include all of the classes of 'y' for classification
# objectives. If that condition doesn't hold with the random sample, then it forcibly
# makes a new random selection in such a way that the condition would always hold, by
# first sampling one random example of 'y' for training and then choosing the evaluation
# set from the remaining rows. The procedure here is quite inefficient, but there aren't
# enough random-related functions in base R to be able to construct an efficient version.
if (is_classif && length(unique(processed_y[idx_train])) < length(lst_args$metadata$y_levels)) {
# These are defined in order to avoid NOTEs from CRAN checks
# when using non-standard data.table evaluation with column names.
idx <- NULL
y <- NULL
ranked_idx <- NULL
chosen <- NULL

dt <- data.table::data.table(y = processed_y, idx = seq(1L, nrows))[
, .(
ranked_idx = seq(1L, .N),
chosen = rep(sample(.N, 1L), .N),
idx
)
, by = y
]
min_idx_train <- dt[ranked_idx == chosen, idx]
rem_idx <- dt[ranked_idx != chosen, idx]
if (length(rem_idx) == nrow_eval) {
idx_train <- min_idx_train
idx_eval <- rem_idx
} else {
rem_idx <- rem_idx[sample(length(rem_idx), length(rem_idx), replace = FALSE)]
idx_eval <- rem_idx[seq(1L, nrow_eval)]
idx_train <- c(min_idx_train, rem_idx[seq(nrow_eval + 1L, length(rem_idx))])
}
}

} else {

if (any(eval_set != floor(eval_set))) {
stop("'eval_set' as indices must contain only integers.")
}
eval_set <- as.integer(eval_set)
idx_min <- min(eval_set)
if (is.na(idx_min) || idx_min < 1L) {
stop("'eval_set' contains invalid indices.")
}
idx_max <- max(eval_set)
if (is.na(idx_max) || idx_max > nrows) {
stop("'eval_set' contains row indices beyond the size of the input data.")
}
idx_train <- seq(1L, nrows)[-eval_set]
if (is_classif && length(unique(processed_y[idx_train])) < length(lst_args$metadata$y_levels)) {
warning("'eval_set' indices will leave some classes of 'y' outside of the training data.")
}
idx_eval <- eval_set

}

# Note: slicing is done in the constructed DMatrix object instead of in the
# original input, because objects from 'Matrix' might change class after
# being sliced (e.g. 'dgRMatrix' turns into 'dgCMatrix').
return(list(idx_train = idx_train, idx_eval = idx_eval))
}

check.early.stopping.rounds <- function(early_stopping_rounds, eval_set) {
if (is.null(early_stopping_rounds)) {
return(NULL)
}
if (is.null(eval_set)) {
stop("'early_stopping_rounds' requires passing 'eval_set'.")
}
if (NROW(early_stopping_rounds) != 1L) {
stop("'early_stopping_rounds' must be NULL or an integer greater than zero.")
}
early_stopping_rounds <- as.integer(early_stopping_rounds)
if (is.na(early_stopping_rounds) || early_stopping_rounds <= 0L) {
stop(
"'early_stopping_rounds' must be NULL or an integer greater than zero. Got: ",
early_stopping_rounds
)
}
return(early_stopping_rounds)
}

#' Fit XGBoost Model
#'
#' @export
Expand Down Expand Up @@ -808,6 +938,28 @@ process.x.and.col.args <- function(
#' 2 (info), and 3 (debug).
#' @param monitor_training Whether to monitor objective optimization progress on the input data.
#' Note that same 'x' and 'y' data are used for both model fitting and evaluation.
#' @param eval_set Subset of the data to use as evaluation set. Can be passed as:
#' - A vector of row indices (base-1 numeration) indicating the observations that are to be designed
#' as evaluation data.
#' - A number between zero and one indicating a random fraction of the input data to use as
#' evaluation data. Note that the selection will be done uniformly at random, regardless of
#' argument `weights`.
#'
#' If passed, this subset of the data will be excluded from the training procedure, and a default
#' metric for the selected objective will be calculated on this dataset after each boosting
#' iteration (pass `verbosity>0` to have these metrics printed during training).
#'
#' If passing a fraction, in classification problems, the evaluation set will be chosen in such a
#' way that at least one observation of each class will be kept in the training data.
#'
#' For more elaborate evaluation variants (e.g. custom metrics, multiple evaluation sets, etc.),
#' one might want to use [xgb.train()] instead.
#' @param early_stopping_rounds Number of boosting rounds after which training will be stopped
#' if there is no improvement in performance (as measured by the default evaluatiation metric
#' for the objective) on the evaluation data from `eval_set`. Must pass `eval_set` in order to use
#' this functionality.
#'
#' If `NULL`, early stopping will not be used.
#' @param nthreads Number of parallel threads to use. If passing zero, will use all CPU threads.
#' @param seed Seed to use for random number generation. If passing `NULL`, will draw a random
#' number using R's PRNG system to use as seed.
Expand Down Expand Up @@ -895,6 +1047,8 @@ xgboost <- function(
weights = NULL,
verbosity = 0L,
monitor_training = verbosity > 0,
eval_set = NULL,
early_stopping_rounds = NULL,
nthreads = parallel::detectCores(),
seed = 0L,
monotone_constraints = NULL,
Expand All @@ -907,7 +1061,7 @@ xgboost <- function(
params <- list(...)
params <- prescreen.parameters(params)
prescreen.objective(objective)
use_qdm <- check.can.use.qdm(x, params)
use_qdm <- check.can.use.qdm(x, params, eval_set)
lst_args <- process.y.margin.and.objective(y, base_margin, objective, params)
lst_args <- process.row.weights(weights, lst_args)
lst_args <- process.x.and.col.args(
Expand All @@ -918,8 +1072,9 @@ xgboost <- function(
lst_args,
use_qdm
)
eval_set <- process.eval.set(eval_set, lst_args)

if (use_qdm && "max_bin" %in% names(params)) {
if (use_qdm && hasName(params, "max_bin")) {
lst_args$dmatrix_args$max_bin <- params$max_bin
}

Expand All @@ -932,10 +1087,17 @@ xgboost <- function(

fn_dm <- if (use_qdm) xgb.QuantileDMatrix else xgb.DMatrix
dm <- do.call(fn_dm, lst_args$dmatrix_args)
if (!is.null(eval_set)) {
dm_eval <- xgb.slice.DMatrix(dm, eval_set$idx_eval)
dm <- xgb.slice.DMatrix(dm, eval_set$idx_train)
}
evals <- list()
if (monitor_training) {
evals <- list(train = dm)
}
if (!is.null(eval_set)) {
evals <- c(evals, list(eval = dm_eval))
}
model <- xgb.train(
params = params,
data = dm,
Expand Down
12 changes: 9 additions & 3 deletions R-package/man/xgb.train.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

28 changes: 28 additions & 0 deletions R-package/man/xgboost.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

67 changes: 67 additions & 0 deletions R-package/tests/testthat/test_xgboost.R
Original file line number Diff line number Diff line change
Expand Up @@ -945,3 +945,70 @@ test_that("Column names from multiquantile are added to leaf predictions", {
expect_equal(dim(pred), c(nrow(x), 1L, 3L))
expect_equal(dimnames(pred)[[3L]], c("q0.25", "q0.5", "q0.75"))
})

test_that("Evaluation fraction leaves examples of all classes for training", {
# With minimal sample leave no remainder
lst_args <- list(
dmatrix_args = list(
data = matrix(seq(1, 4), ncol = 1L),
label = c(0, 0, 1, 1)
),
metadata = list(
y_levels = c("a", "b")
),
params = list(
seed = 123
)
)
for (retry in seq_len(10)) {
lst_args$params$seed <- retry
res <- process.eval.set(0.5, lst_args)
expect_equal(length(intersect(res$idx_train, res$idx_eval)), 0)
expect_equal(length(res$idx_train), 2L)
expect_equal(length(res$idx_eval), 2L)
expect_true(length(intersect(c(1L, 2L), res$idx_train)) >= 1L)
expect_true(length(intersect(c(3L, 4L), res$idx_train)) >= 1L)
}

# With minimal sample leaving some remainder
lst_args <- list(
dmatrix_args = list(
data = matrix(seq(1, 5), ncol = 1L),
label = c(0, 0, 1, 1, 1)
),
metadata = list(
y_levels = c("a", "b")
),
params = list(
seed = 123
)
)
for (retry in seq_len(20)) {
lst_args$params$seed <- retry
res <- process.eval.set(0.4, lst_args)
expect_equal(length(intersect(res$idx_train, res$idx_eval)), 0)
expect_equal(length(res$idx_train), 3L)
expect_equal(length(res$idx_eval), 2L)
expect_true(length(intersect(c(1L, 2L), res$idx_train)) >= 1L)
expect_true(length(intersect(c(3L, 4L, 5L), res$idx_train)) >= 1L)
}
})

test_that("'eval_set' as fraction works", {
y <- iris$Species
x <- iris[, -5L]
model <- xgboost(
x,
y,
base_margin = matrix(0.1, nrow = nrow(x), ncol = 3L),
eval_set = 0.2,
nthreads = 1L,
nrounds = 4L,
max_depth = 2L
)
expect_true(hasName(attributes(model), "evaluation_log"))
evaluation_log <- attributes(model)$evaluation_log
expect_equal(nrow(evaluation_log), 4L)
expect_true(hasName(evaluation_log, "eval_mlogloss"))
expect_equal(length(attributes(model)$metadata$y_levels), 3L)
})
Loading