Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost - Preprocessing Support #69

Open
psxmc6 opened this issue May 20, 2021 · 5 comments
Open

XGBoost - Preprocessing Support #69

psxmc6 opened this issue May 20, 2021 · 5 comments

Comments

@psxmc6
Copy link

psxmc6 commented May 20, 2021

Hi Villu,

I would like to seek an advice regarding the best way to enrich r2pmml-generated xgboost PMML with data preprocessing steps.

As you pointed out in this thread, model formula interface can't be used in combination with xgboost model.

So far, I've been leveraging legacy pmml library to produce PMML snippets for necessary transformations (using e.g. xform_function, xform_norm_discrete), and the resulting transformation-only PMML has been then merged with model-only PMML, but ideally I would like to rely only on r2pmml package exclusively.

The aforementioned pmml package does not support all PMML built-in functions, but provides a way to define missing functions' logic in R environment so that they will be recognised when called in xform_function (see section PMML functions not supported by xform_function).

I would see the following components:

  1. adding R -> PMML mappers in the r2pmml/JPMML-R of all supported built-in functions

  2. adding some intermediate step to inject the result of applying transformations into r2pmml() function so that the converter would incorporate it in the final PMML representation

Could you elaborate on how this could be solved?

Kind regards

@vruusmann
Copy link
Member

Related to #35, #36

As you pointed out in this thread, model formula interface can't be used in combination with xgboost model.

It's an XGBoost limitation.

You could emulate formula interface like this:

xgb.formula = as.formula(..)
# Tranform data.frame
Xt = apply_formula_to_data_frame(X, xgb.formula)
# Train XGBoost using the transformed data frame
xgb.model = xgb(x = Xt, y = label, ...)
# Attach formula to the model
xgb.model$formula = xgb.formula
# Convert to PMML
r2pmml(xgb.model, "xgboost.pmml")

This is the idea behind #36.

So far, I've been leveraging legacy pmml library to produce PMML snippets for necessary transformations

You must be extremely sharp/skilled. I never managed to figure out how to use the legacy pmml package for feature transformations (for integration testing purposes).

Could you elaborate on how this could be solved?

There needs to be something that both R runtime environment can execute (apply to a data frame), and that can be serialized as an RDS data format file so that the R2PMML converter can see it.

If you do free-form feature engineering in R script, then it cannot be dumped as a single R object.

However, if you do feature engineering using Tidyverse recipes, then that could be dumped in RDS data format.

@vruusmann
Copy link
Member

@psxmc6 If the conversion to PMML weren't a problem, then how would you do feature engineering for R? Which package, which functions (for continuous and categorical features)?

The only "limitation" is that the solution must be dumpable into a file in RDS data format, and when loaded back into a clean R environment from the RDS file, must be "complete" - should be executable without much R scripting effort.

@psxmc6
Copy link
Author

psxmc6 commented May 22, 2021

So I don't have anything specific in mind, and please correct me if I am wrong but in the end, we are limited to what can be used by the list of PMML built-in functions? I found that substring, replace, isIn, matches, if, and, or allow you to express quite a broad range of transformations and these will likely be available in many different packages.

@vruusmann
Copy link
Member

vruusmann commented May 22, 2021

.. but in the end, we are limited to what can be used by the list of PMML built-in functions?

Not exactly.

PMML has three functionality/markup layers:

  1. Model
  2. Transformation
  3. Function

We should focus on the middle layer, which are elements dedicated to representing feature transformations (the classification is based on operational type):

Only when you cannot solve your problem in the middle layer using the above four elements, you shall fall to the lowest level and start using PMML built-in functions using the Apply element.

@vruusmann
Copy link
Member

@psxmc6 Challenge rephrased - if you're starting with a raw dataset, and need to perform these middle-level transformations on your data (before sending it to XGBoost), how would you do it in the R language?

For example, you want to bin a continuous feature to categorical. The R2PMML package currently supports base::cut()function via the formula interface. But as we know, the XGBoost package does not have formula support. What's the alternative?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants