-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dataframe data format in native XGBoost. #9828
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,7 +19,8 @@ | |
#' @param missing a float value to represents missing values in data (used only when input is a dense matrix). | ||
#' It is useful when a 0 or some other extreme value represents missing values in data. | ||
#' @param silent whether to suppress printing an informational message after loading from a file. | ||
#' @param feature_names Set names for features. | ||
#' @param feature_names Set names for features. Overrides column names in data | ||
#' frame and matrix. | ||
#' @param nthread Number of threads used for creating DMatrix. | ||
#' @param group Group size for all ranking group. | ||
#' @param qid Query ID for data samples, used for ranking. | ||
|
@@ -32,6 +33,8 @@ | |
#' If a DMatrix gets serialized and then de-serialized (for example, when saving data in an R session or caching | ||
#' chunks in an Rmd file), the resulting object will not be usable anymore and will need to be reconstructed | ||
#' from the original source of data. | ||
#' @param enable_categorical Experimental support of specializing for | ||
#' categorical features. JSON/UBJSON serialization format is required. | ||
#' | ||
#' @examples | ||
#' data(agaricus.train, package='xgboost') | ||
|
@@ -58,19 +61,28 @@ xgb.DMatrix <- function( | |
qid = NULL, | ||
label_lower_bound = NULL, | ||
label_upper_bound = NULL, | ||
feature_weights = NULL | ||
feature_weights = NULL, | ||
enable_categorical = FALSE | ||
) { | ||
if (!is.null(group) && !is.null(qid)) { | ||
stop("Either one of 'group' or 'qid' should be NULL") | ||
} | ||
cnames <- NULL | ||
ctypes <- NULL | ||
if (typeof(data) == "character") { | ||
if (length(data) > 1) | ||
stop("'data' has class 'character' and length ", length(data), | ||
".\n 'data' accepts either a numeric matrix or a single filename.") | ||
if (length(data) > 1) { | ||
stop( | ||
"'data' has class 'character' and length ", length(data), | ||
".\n 'data' accepts either a numeric matrix or a single filename." | ||
) | ||
} | ||
data <- path.expand(data) | ||
handle <- .Call(XGDMatrixCreateFromFile_R, data, as.integer(silent)) | ||
} else if (is.matrix(data)) { | ||
handle <- .Call(XGDMatrixCreateFromMat_R, data, missing, as.integer(NVL(nthread, -1))) | ||
handle <- .Call( | ||
XGDMatrixCreateFromMat_R, data, missing, as.integer(NVL(nthread, -1)) | ||
) | ||
cnames <- colnames(data) | ||
trivialfis marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} else if (inherits(data, "dgCMatrix")) { | ||
handle <- .Call( | ||
XGDMatrixCreateFromCSC_R, | ||
|
@@ -103,6 +115,40 @@ xgb.DMatrix <- function( | |
missing, | ||
as.integer(NVL(nthread, -1)) | ||
) | ||
} else if (is.data.frame(data)) { | ||
ctypes <- sapply(data, function(x) { | ||
if (is.factor(x)) { | ||
if (!enable_categorical) { | ||
stop( | ||
"When factor type is used, the parameter `enable_categorical`", | ||
" must be set to TRUE." | ||
) | ||
} | ||
"c" | ||
} else if (is.integer(x)) { | ||
"int" | ||
} else if (is.logical(x)) { | ||
"i" | ||
trivialfis marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question there: I see this is also being used in the pandas adapter. In R, boolean (logical) types are represented as C I see you mention in a comment later that these get converted to numeric type, but the C++ code still checks for integer/logical-typed columns. What would happen with these missing values encoded as -INT_MAX if the columns are supplied in their original types? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As suggested by the comments in C++, those C++ handling code is not used but is more or less a reminder that we should try to avoid data transformation in R. I think the previous reply might help with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can remove the code if it's hindering readability There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I actually was thinking something along the lines that using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the suggestion, I removed the |
||
} else { | ||
if (!is.numeric(x)) { | ||
stop("Invalid type in dataframe.") | ||
} | ||
"float" | ||
} | ||
}) | ||
## as.data.frame somehow converts integer/logical into real. | ||
data <- as.data.frame(sapply(data, function(x) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think
Ideally, we would like to handle different types of columns independently without any coercing, and hence without any data copying. However, at the moment only cuDF input can be consumed in this way due to missing value handling. R uses sentinel values to indicate missing/NA, while XGBoost can't have more than one missing value indicator at the moment. As a result, a DF containing a float column and an integer column with NAs can confuse XGBoost what value it should eliminate. Is it The cuDF uses arrow IPC format as its memory layout and exposes them as part of the API, missing values are represented by a bitmask, we can handle all the columns without any transformation (except for categorical encoding). |
||
if (is.factor(x)) { | ||
## XGBoost uses 0-based indexing. | ||
as.numeric(x) - 1 | ||
} else { | ||
x | ||
} | ||
})) | ||
handle <- .Call( | ||
XGDMatrixCreateFromDF_R, data, missing, as.integer(NVL(nthread, -1)) | ||
) | ||
cnames <- colnames(data) | ||
} else { | ||
stop("xgb.DMatrix does not support construction from ", typeof(data)) | ||
} | ||
|
@@ -119,7 +165,11 @@ xgb.DMatrix <- function( | |
if (!is.null(base_margin)) { | ||
setinfo(dmat, "base_margin", base_margin) | ||
} | ||
if (!is.null(cnames)) { | ||
setinfo(dmat, "feature_name", cnames) | ||
} | ||
if (!is.null(feature_names)) { | ||
## override cnames | ||
setinfo(dmat, "feature_name", feature_names) | ||
} | ||
if (!is.null(group)) { | ||
|
@@ -137,6 +187,9 @@ xgb.DMatrix <- function( | |
if (!is.null(feature_weights)) { | ||
setinfo(dmat, "feature_weights", feature_weights) | ||
} | ||
if (!is.null(ctypes)) { | ||
setinfo(dmat, "feature_type", ctypes) | ||
trivialfis marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
return(dmat) | ||
} | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -159,6 +159,16 @@ XGB_DLL int XGDMatrixCreateFromURI(char const *config, DMatrixHandle *out); | |
XGB_DLL int XGDMatrixCreateFromCSREx(const size_t *indptr, const unsigned *indices, | ||
const float *data, size_t nindptr, size_t nelem, | ||
size_t num_col, DMatrixHandle *out); | ||
/** | ||
* @brief Create a DMatrix from columnar data. (table) | ||
* | ||
* @param data See @ref XGBoosterPredictFromColumnar for details. | ||
* @param config See @ref XGDMatrixCreateFromDense for details. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something I'm wondering here: if this config already conveys the information about whether a column has integer type, is it actually needed to make a distinction between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The columns don't convey the information accurately since we need to do some transformations before passing them into XGBoost. For instance, if a column is integer with missing values, we have to use float with NaN as an approximate. |
||
* @param out The created dmatrix. | ||
* | ||
* @return 0 when success, -1 when failure happens | ||
*/ | ||
XGB_DLL int XGDMatrixCreateFromColumnar(char const *data, char const *config, DMatrixHandle *out); | ||
|
||
/** | ||
* @example c-api-demo.c | ||
|
@@ -514,6 +524,16 @@ XGB_DLL int | |
XGProxyDMatrixSetDataCudaArrayInterface(DMatrixHandle handle, | ||
const char *c_interface_str); | ||
|
||
/** | ||
* @brief Set columnar (table) data on a DMatrix proxy. | ||
* | ||
* @param handle A DMatrix proxy created by @ref XGProxyDMatrixCreate | ||
* @param c_interface_str See @ref XGBoosterPredictFromColumnar for details. | ||
* | ||
* @return 0 when success, -1 when failure happens | ||
*/ | ||
XGB_DLL int XGProxyDMatrixSetDataCudaColumnar(DMatrixHandle handle, char const *c_interface_str); | ||
|
||
/*! | ||
* \brief Set data on a DMatrix proxy. | ||
* | ||
|
@@ -1113,6 +1133,31 @@ XGB_DLL int XGBoosterPredictFromDense(BoosterHandle handle, char const *values, | |
* @example inference.c | ||
*/ | ||
|
||
/** | ||
* @brief Inplace prediction from CPU columnar data. (Table) | ||
* | ||
* @note If the booster is configured to run on a CUDA device, XGBoost falls back to run | ||
* prediction with DMatrix with a performance warning. | ||
* | ||
* @param handle Booster handle. | ||
* @param values An JSON array of __array_interface__ for each column. | ||
* @param config See @ref XGBoosterPredictFromDMatrix for more info. | ||
* Additional fields for inplace prediction are: | ||
* - "missing": float | ||
* @param m An optional (NULL if not available) proxy DMatrix instance | ||
* storing meta info. | ||
* | ||
* @param out_shape See @ref XGBoosterPredictFromDMatrix for more info. | ||
* @param out_dim See @ref XGBoosterPredictFromDMatrix for more info. | ||
* @param out_result See @ref XGBoosterPredictFromDMatrix for more info. | ||
* | ||
* @return 0 when success, -1 when failure happens | ||
*/ | ||
XGB_DLL int XGBoosterPredictFromColumnar(BoosterHandle handle, char const *array_interface, | ||
char const *c_json_config, DMatrixHandle m, | ||
bst_ulong const **out_shape, bst_ulong *out_dim, | ||
const float **out_result); | ||
|
||
/** | ||
* \brief Inplace prediction from CPU CSR matrix. | ||
* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: do I understand it correctly that this parameter is only used to auto-detect categorical features from data frames, but would otherwise play no role if e.g. the user were to manually set this field in the DMatrix later through
setinfo
, for example?If so, how about renaming it to 'autodetect_categorical' or something along those lines? (both in the R and Python interfaces) Would also be ideal to describe a bit more of it in the docs (e.g. that it's only for data frames).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. It's more or less a guard to prevent surprise since XGBoost didn't accept categorical data before, which might cause issues in silence if we suddenly accept it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong preference on the naming, we have an introductory document for cat data in the tutorials, feel free to add additional explanation.