Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Add predict method for new xgboost() #11041

Merged
merged 7 commits into from
Dec 4, 2024

Conversation

david-cortes
Copy link
Contributor

@david-cortes david-cortes commented Dec 1, 2024

ref #9810

closes #7947
closes #7935
closes #7906

This PR adds a predict method for the xgboost model class as produced by the redesigned xgboost() function.

It's mostly just a wrapper over the old predict.xgb.Booster, with the difference that:

  • It adds more metadata to the outputs.
  • It uses the more idiomatic type argument to control which prediction type to make.
  • It works only with R objects as inputs for data.
  • It leaves out prediction arguments that aren't meant for a higher-level interface, like the "training" argument which is only used for internal calls.

Along the way, it also adds docs for print.xgboost which were missing it, and makes some corrections over the docs of outputs from predict.xgb.Booster to reflect current behavior.

#' Output will be a numeric matrix of shape `[nrows, nfeatures+1, nfeatures+1]`, or shape
#' `[nrows, nscores, nfeatures+1, nfeatures+1]` (for objectives that produce more than one score
#' per observation).
#' - `"approxinteraction"`: similar to `"interaction"`, but uses a fast approximation for the
Copy link
Member

@trivialfis trivialfis Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approxcontrib method was added at the time to demonstrate that the traditional approach has a bias, whereas SHAP doesn't. It's not implemented as an optimization. If I am not mistaken, it's mostly for academic reasons to have a comparison.
We are considering whether it should continue to exist, and we definitely don't want to "advertise" it as a faster alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the approx options.

@@ -226,7 +234,7 @@ xgb.get.handle <- function(object) {
#' - For normal predictions, the dimension is `[nrows, ngroups]`.
#' - For `predcontrib=TRUE`, the dimension is `[nrows, ngroups, nfeats+1]`.
#' - For `predinteraction=TRUE`, the dimension is `[nrows, ngroups, nfeats+1, nfeats+1]`.
#' - For `predleaf=TRUE`, the dimension is `[nrows, niter, ngroups, num_parallel_tree]`.
#' - For `predleaf=TRUE`, the dimension is `[nrows, niter * ngroups * num_parallel_tree]`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this changed from array to matrix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not reshaped internally. It now takes the shape as it comes from the C interface, and that's how it currently comes. Before those changes (in a previous PR), it was applying its own shaping logic and that's how it used to be, but the docs had not been updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-cortes Do you think the strict_shape parameter is better? (It makes the prediction output 4-dim array from the C API for predleaf)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it'd be nicer to have the output with 3 or more dimensions for leaf predictions from vector-valued trees in predict.xgboost - will update it - but I wouldn't say that having arrays with empty dimensions would be preferrable for cases where the output is actually 2D, like binary classification.

For predict.xgb.Booster, I'd nevertheless prefer it to stick to what the C interface outputs, so that it'd behave the same as the other language bindings and so that changes from the C side would not require changes in the R logic.

Considering that the next release will be a major version update (3.0), perhaps it could change the logic from the C side to output a 3D array for vector-valued tree leafs instead? It currently does add those extra dimensions for contributions and interactions for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hidden in the strict_shape parameter of the predict method. It's available for all prediction methods (csr, DMatrix ...). We can make it default for the R package if needed.

But yes, it might output an empty dimension, it's easier to index from my perspective, don't have to worry about the details of the model (I have a fear of array indexing with conditionally checking and calling np.squeeze, reshape(-1), np.newaxis etc, always wish the input for my code to have the deterministic shape).

This is opinionated but I'm open to change. Would love to learn your experience on when varying the dimension based on the model makes things easier or less confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep strict_shape=FALSE as the default. I find it confusing to get arrays with dimensions that are not applicable to the given model type that one is using.

strict_shape=TRUE would also not be in line with how other packages behave - for example, VGAM determines the output dimensions conditionally on the family argument that the model used, returning 1D for binary classification and 2D for multi-class clasification.

Although it looks like reshaping higher-dim arrays to 2D might actually be more common - taking VGAM again as a reference, it squeezes to 2D when predicting type="terms" for multinomial logistic, for example, and same for other packages like MGLM. I'd nevertheless still prefer to keep XGBoost's outputs as higher-dim arrays when they represent multiple dimensions.

@david-cortes
Copy link
Contributor Author

Changed it to produce multi-dimensional outputs for leaf predictions and updated the docs accordingly, for both predict.xgboost and predict.xgb.Booster.

@trivialfis trivialfis merged commit d5693bd into dmlc:master Dec 4, 2024
27 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] predict doesn't return probabilities for multi:softmax [RFC] Making R interface more idiomatic
2 participants