-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XGBoost - Preprocessing Support #69
Comments
It's an XGBoost limitation. You could emulate formula interface like this: xgb.formula = as.formula(..)
# Tranform data.frame
Xt = apply_formula_to_data_frame(X, xgb.formula)
# Train XGBoost using the transformed data frame
xgb.model = xgb(x = Xt, y = label, ...)
# Attach formula to the model
xgb.model$formula = xgb.formula
# Convert to PMML
r2pmml(xgb.model, "xgboost.pmml") This is the idea behind #36.
You must be extremely sharp/skilled. I never managed to figure out how to use the legacy
There needs to be something that both R runtime environment can execute (apply to a data frame), and that can be serialized as an RDS data format file so that the R2PMML converter can see it. If you do free-form feature engineering in R script, then it cannot be dumped as a single R object. However, if you do feature engineering using Tidyverse recipes, then that could be dumped in RDS data format. |
@psxmc6 If the conversion to PMML weren't a problem, then how would you do feature engineering for R? Which package, which functions (for continuous and categorical features)? The only "limitation" is that the solution must be dumpable into a file in RDS data format, and when loaded back into a clean R environment from the RDS file, must be "complete" - should be executable without much R scripting effort. |
So I don't have anything specific in mind, and please correct me if I am wrong but in the end, we are limited to what can be used by the list of PMML built-in functions? I found that substring, replace, isIn, matches, if, and, or allow you to express quite a broad range of transformations and these will likely be available in many different packages. |
Not exactly. PMML has three functionality/markup layers:
We should focus on the middle layer, which are elements dedicated to representing feature transformations (the classification is based on operational type):
Only when you cannot solve your problem in the middle layer using the above four elements, you shall fall to the lowest level and start using PMML built-in functions using the Apply element. |
@psxmc6 Challenge rephrased - if you're starting with a raw dataset, and need to perform these middle-level transformations on your data (before sending it to XGBoost), how would you do it in the R language? For example, you want to bin a continuous feature to categorical. The R2PMML package currently supports |
Hi Villu,
I would like to seek an advice regarding the best way to enrich r2pmml-generated xgboost PMML with data preprocessing steps.
As you pointed out in this thread, model formula interface can't be used in combination with xgboost model.
So far, I've been leveraging legacy pmml library to produce PMML snippets for necessary transformations (using e.g. xform_function, xform_norm_discrete), and the resulting transformation-only PMML has been then merged with model-only PMML, but ideally I would like to rely only on r2pmml package exclusively.
The aforementioned pmml package does not support all PMML built-in functions, but provides a way to define missing functions' logic in R environment so that they will be recognised when called in xform_function (see section PMML functions not supported by xform_function).
I would see the following components:
adding R -> PMML mappers in the r2pmml/JPMML-R of all supported built-in functions
adding some intermediate step to inject the result of applying transformations into r2pmml() function so that the converter would incorporate it in the final PMML representation
Could you elaborate on how this could be solved?
Kind regards
The text was updated successfully, but these errors were encountered: