Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on a pipeline with OneHotEncoder and xgboost #22

Open
Hao-Jiang opened this issue Jun 29, 2022 · 2 comments
Open

Error on a pipeline with OneHotEncoder and xgboost #22

Hao-Jiang opened this issue Jun 29, 2022 · 2 comments

Comments

@Hao-Jiang
Copy link

Hao-Jiang commented Jun 29, 2022

Hello,

I trained a PMMLPipeline with OneHotEncoder and XGBClassifier using the following code snippet.

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn2pmml import sklearn2pmml, PMMLPipeline
from xgboost.sklearn import XGBClassifier


mapper = DataFrameMapper(
    [(col, None) for col in numerical_cols] +
    [([col], OneHotEncoder(handle_unknown='ignore')) for col in categorical_cols]
)

pipeline = PMMLPipeline(
    steps=[
        ('mapper', mapper),
        ('classifier', XGBClassifier())
    ]
)

pipeline.fit(X,  y)

The pipeline seemed to work and I was able to use it to do predictions.
But I got an error when I tried to turn the pipeline into a pmml file
sklearn2pmml(pipeline, "testing.pmml", with_repr=True)

Standard error:
Exception in thread "main" org.jpmml.model.MissingAttributeException: Required attribute Value@value is not defined
	at org.dmg.pmml.Value.requireValue(Value.java:67)
	at org.jpmml.converter.PMMLUtil.getValues(PMMLUtil.java:139)
	at org.jpmml.converter.PMMLUtil.getValues(PMMLUtil.java:124)
	at org.jpmml.converter.CategoricalFeature.<init>(CategoricalFeature.java:35)
	at org.jpmml.converter.WildcardFeature.toCategoricalFeature(WildcardFeature.java:61)
	at sklearn.preprocessing.MultiOneHotEncoder.encodeFeatures(MultiOneHotEncoder.java:118)
	at sklearn.Transformer.encode(Transformer.java:69)
	at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:67)
	at sklearn.Transformer.encode(Transformer.java:69)
	at sklearn.Composite.encodeFeatures(Composite.java:119)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:212)
	at com.sklearn2pmml.Main.run(Main.java:84)
	at com.sklearn2pmml.Main.main(Main.java:62)

Can someone give me some advice on what I might have done wrong? Thanks.

@vruusmann
Copy link
Member

vruusmann commented Jun 30, 2022

I trained a PMMLPipeline with OneHotEncoder and XGBClassifier using the following code snippet.

First of all - what is your XGBoost package version?

If you upgrade to XGBoost 1.5.X or newer, then you shall be able to utilize XGBoost's new native One-Hot-Encoding (OHE) support. It's much more memory efficient than dealing with an external OneHotEncoder step, especially when dealing with sparse features.

Even better, you might consider upgrading to XGBoost 1.6.X or newer, and you shall be able to utilize XGBoost's new native multi-category categorical splits.

So, please upgrade your XGBoost package (and the SkLearn2PMML package as well!) to the latest, and simplify your Scikit-Learn pipeline to the following:

mapper = DataFrameMapper(
    [(col, None) for col in numerical_cols] +
    [([col], None) for col in categorical_cols]
)

The pipeline seemed to work and I was able to use it to do predictions.

Just a sidenote - Scikit-Learn is willing to fit all kinds of pipelines, without checking if the sequence of computational steps makes any sense or not. For as long as your "number of columns" is good, you'll be getting predictions.

However, the Scikit-Learn to PMML converter tries to understand the logic of each computational step. Therefore, if something does not make sense to it, it'll complain (eg. by raising an exception). You should heed to those complaints, and try to make your pipeline more information-rich.

I got an error when I tried to turn the pipeline into a pmml file

Exception in thread "main" org.jpmml.model.MissingAttributeException: Required attribute Value@value is not defined
  at org.dmg.pmml.Value.requireValue(Value.java:67)

Looks like the converter was unable to figure out the list of category values for some categorical feature.

Internal note - it's interesting that the converter is complaining about a missing DataField/Value@value attribute, and not about a missing DataField/Value element itself.

Could it be that your dataset contains a column with a None or float("NaN") category level? This seems like one plausible scenario how there can be a DataField/Value element whose @value attribute has been omitted (filtered out as a placeholder for a missing value).

You can make your pipeline more robust by collecting and storing category values using SkLearn2PMML domain decorator classes:

from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain

mapper = DataFrameMapper(
    [(col, ContinuousDomain()) for col in numerical_cols] +
    [([col], CategoricalDomain()) for col in categorical_cols]
)

At minimum, this should give you a different, more informative error.

@vruusmann
Copy link
Member

Leaving this issue open as a reminder to improve error diagnostics in this area.

The current Java exception is void of any debugging information, because it is raised for a condition which is supposed to never trigger (a required attribute has not been set in JPMML-Converter library stack).

@vruusmann vruusmann transferred this issue from jpmml/sklearn2pmml Sep 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants