[R] Add predict method for new `xgboost()` #11041

david-cortes · 2024-12-01T18:08:47Z

closes #7947
closes #7935
closes #7906

This PR adds a predict method for the xgboost model class as produced by the redesigned xgboost() function.

It's mostly just a wrapper over the old predict.xgb.Booster, with the difference that:

It adds more metadata to the outputs.
It uses the more idiomatic type argument to control which prediction type to make.
It works only with R objects as inputs for data.
It leaves out prediction arguments that aren't meant for a higher-level interface, like the "training" argument which is only used for internal calls.

Along the way, it also adds docs for print.xgboost which were missing it, and makes some corrections over the docs of outputs from predict.xgb.Booster to reflect current behavior.

trivialfis · 2024-12-02T07:44:57Z

R-package/R/xgboost.R

+#'   Output will be a numeric matrix of shape `[nrows, nfeatures+1, nfeatures+1]`, or shape
+#'   `[nrows, nscores, nfeatures+1, nfeatures+1]` (for objectives that produce more than one score
+#'   per observation).
+#' - `"approxinteraction"`: similar to `"interaction"`, but uses a fast approximation for the


I think the approxcontrib method was added at the time to demonstrate that the traditional approach has a bias, whereas SHAP doesn't. It's not implemented as an optimization. If I am not mistaken, it's mostly for academic reasons to have a comparison.
We are considering whether it should continue to exist, and we definitely don't want to "advertise" it as a faster alternative.

Removed the approx options.

trivialfis · 2024-12-03T17:55:21Z

R-package/R/xgb.Booster.R

@@ -226,7 +234,7 @@ xgb.get.handle <- function(object) {
 #' - For normal predictions, the dimension is `[nrows, ngroups]`.
 #' - For `predcontrib=TRUE`, the dimension is `[nrows, ngroups, nfeats+1]`.
 #' - For `predinteraction=TRUE`, the dimension is `[nrows, ngroups, nfeats+1, nfeats+1]`.
-#' - For `predleaf=TRUE`, the dimension is `[nrows, niter, ngroups, num_parallel_tree]`.
+#' - For `predleaf=TRUE`, the dimension is `[nrows, niter * ngroups * num_parallel_tree]`.


Is this changed from array to matrix?

It's not reshaped internally. It now takes the shape as it comes from the C interface, and that's how it currently comes. Before those changes (in a previous PR), it was applying its own shaping logic and that's how it used to be, but the docs had not been updated.

@david-cortes Do you think the strict_shape parameter is better? (It makes the prediction output 4-dim array from the C API for predleaf)

Yes, I think it'd be nicer to have the output with 3 or more dimensions for leaf predictions from vector-valued trees in predict.xgboost - will update it - but I wouldn't say that having arrays with empty dimensions would be preferrable for cases where the output is actually 2D, like binary classification.

For predict.xgb.Booster, I'd nevertheless prefer it to stick to what the C interface outputs, so that it'd behave the same as the other language bindings and so that changes from the C side would not require changes in the R logic.

Considering that the next release will be a major version update (3.0), perhaps it could change the logic from the C side to output a 3D array for vector-valued tree leafs instead? It currently does add those extra dimensions for contributions and interactions for example.

It's hidden in the strict_shape parameter of the predict method. It's available for all prediction methods (csr, DMatrix ...). We can make it default for the R package if needed.

But yes, it might output an empty dimension, it's easier to index from my perspective, don't have to worry about the details of the model (I have a fear of array indexing with conditionally checking and calling np.squeeze, reshape(-1), np.newaxis etc, always wish the input for my code to have the deterministic shape).

This is opinionated but I'm open to change. Would love to learn your experience on when varying the dimension based on the model makes things easier or less confusing.

I would prefer to keep strict_shape=FALSE as the default. I find it confusing to get arrays with dimensions that are not applicable to the given model type that one is using.

strict_shape=TRUE would also not be in line with how other packages behave - for example, VGAM determines the output dimensions conditionally on the family argument that the model used, returning 1D for binary classification and 2D for multi-class clasification.

Although it looks like reshaping higher-dim arrays to 2D might actually be more common - taking VGAM again as a reference, it squeezes to 2D when predicting type="terms" for multinomial logistic, for example, and same for other packages like MGLM. I'd nevertheless still prefer to keep XGBoost's outputs as higher-dim arrays when they represent multiple dimensions.

david-cortes · 2024-12-03T19:12:36Z

Changed it to produce multi-dimensional outputs for leaf predictions and updated the docs accordingly, for both predict.xgboost and predict.xgb.Booster.

david-cortes added 2 commits December 1, 2024 19:04

add predict.xgboost + docfixes

ede49b8

linter

e85aaee

trivialfis reviewed Dec 2, 2024

View reviewed changes

david-cortes added 2 commits December 2, 2024 19:30

remove approxcontrib and approxinteraction

a929343

update docs

941d7b9

trivialfis approved these changes Dec 3, 2024

View reviewed changes

david-cortes added 2 commits December 3, 2024 19:03

make distinction between matrices and arrays

ba2a7a5

return multi-dimensional arrays for leaf predictions, correct docs

8341c10

update docs

47c4b3b

trivialfis approved these changes Dec 4, 2024

View reviewed changes

trivialfis merged commit d5693bd into dmlc:master Dec 4, 2024
27 of 31 checks passed

david-cortes mentioned this pull request Dec 4, 2024

[R] Use predict.xgb.Booster internally when needed #11060

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Add predict method for new `xgboost()` #11041

[R] Add predict method for new `xgboost()` #11041

david-cortes commented Dec 1, 2024 •

edited

Loading

trivialfis Dec 2, 2024 •

edited

Loading

david-cortes Dec 2, 2024

trivialfis Dec 3, 2024

david-cortes Dec 3, 2024

trivialfis Dec 3, 2024

david-cortes Dec 3, 2024

trivialfis Dec 3, 2024

david-cortes Dec 3, 2024

david-cortes commented Dec 3, 2024

[R] Add predict method for new xgboost() #11041

[R] Add predict method for new xgboost() #11041

Conversation

david-cortes commented Dec 1, 2024 • edited Loading

trivialfis Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

david-cortes Dec 2, 2024

Choose a reason for hiding this comment

trivialfis Dec 3, 2024

Choose a reason for hiding this comment

david-cortes Dec 3, 2024

Choose a reason for hiding this comment

trivialfis Dec 3, 2024

Choose a reason for hiding this comment

david-cortes Dec 3, 2024

Choose a reason for hiding this comment

trivialfis Dec 3, 2024

Choose a reason for hiding this comment

david-cortes Dec 3, 2024

Choose a reason for hiding this comment

david-cortes commented Dec 3, 2024

[R] Add predict method for new `xgboost()` #11041

[R] Add predict method for new `xgboost()` #11041

david-cortes commented Dec 1, 2024 •

edited

Loading

trivialfis Dec 2, 2024 •

edited

Loading