Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect invalid value treatment policy based on the "transformer composition" of SkLearn pipeline #436

Open
fritshermans opened this issue Dec 3, 2024 · 12 comments

Comments

@fritshermans
Copy link

When I train a sklearn pipeline containing a TargetEncoder and convert it using sklearn2pmml to a PMML file, I get an error when a categorical value that was not seen during training is present in new data. The desired behavior is that the default value is returned. When I would create the pipeline using the PMMLPipeline object and define the categorical variable using CategoricalDomain with invalid_value_treatment = "as_is", it works well on unseen categorical data.

Is there a way to avoid this problem when I want to convert an existing trained sklearn pipeline?

@vruusmann
Copy link
Member

When I train a sklearn pipeline containing a TargetEncoder ...

Are you talking about category_encoders.target_encoder.TargetEncoder or sklearn.preprocessing.TargetEncoder here?

@fritshermans
Copy link
Author

The sklearn version :-)

@vruusmann
Copy link
Member

Is there a way to avoid this problem when I want to convert an existing trained sklearn pipeline?

I read the initial comment wrong, because I got the impression that the TargetEncoder converter is performing a bad job. However, this cannot be the case, because the MapValues@defaultValue attribute is correctly set to the mean value here (this is the value that gets returned when the MapValues table does not contain a mapping for the input value):
https://github.com/jpmml/jpmml-sklearn/blob/1.8.6/pmml-sklearn/src/main/java/sklearn/preprocessing/TargetEncoder.java#L78-L80

So, the question is really about "retrofitting" an existing SkLearn pipeline, to make it "invalid value aware" long time after it was trained and saved?

The trouble is that Scikit-Learn is missing a consistent support for invalid values in the first place. It has been added sporadically, to different estimator classes at different times.

There are really two options here:

  1. Modify the SkLearn pipeline, by prepending a meta-transformer to it that filters all problematic columns through appropriate ContnuousDomain, CategoricalDomain or OrdinalDomain decorators. These decorators allow you to set the desired nvalid value treatment using the Domain.invalid_value_treatment attribute (you already got this).
  2. Modify the resulting PMML document, by visitng all MiningField elements, and appending a MiningField@invalidValueTreatment="as_is" attribute to them . Please note that the default value for this attribute is returnInvalid (also takes effect when the attribute is not defined): https://dmg.org/pmml/v4-4-1/MiningSchema.html#xsdType_INVALID-VALUE-TREATMENT-METHOD

Please indicate which pathway (of the above two) are you likely to consider, so that we can keep brainstorming in the right direction.

@fritshermans
Copy link
Author

I fixed it for now by the second option. You could consider making all categoricals with invalidValueTreatment="as_is" but I can understand you wouldn't like that...

@vruusmann
Copy link
Member

There are really two options here:

My bad - there is a third option, which may qualify as a SkLearn2PMML/JPMML-SkLearn bug.

Any time when the converter sets a <Expression>@defaultValue attribute, it should perform an internal sanity check that the input field has "invalid values enabled".

Right now, the MapValue@defaultValue attribute is set, but invalid values are actually prevented from reaching it because there is a blocking MiningField@invalidValueTreatment="returnInvalid" declaration in the way.

@vruusmann
Copy link
Member

This issue reminds me of another issue: #428

@vruusmann
Copy link
Member

vruusmann commented Dec 3, 2024

Any time when the converter sets a @DefaultValue attribute, it should perform an internal sanity check that the input field has "invalid values enabled".

The converter currently knows whether the input column had an explicit Domain decorator assigned to it or not.

If the decorator was set, then its stated invalid value treatment should prevail. However, when it was not set, then a flexible default should be applied.

Can you point me to official documentation about Scikit-Learn's invalid value (aka unknown value) handling policy? I assume that they were not allowed in the past (eg. SkLearn 0.X versions), but have been gradually enabled in recent versions (esp. 1.3.X and newer). The "flexible default" should try to match this evolution.

@fritshermans
Copy link
Author

I'm not sure where to find that. I think the check of invalid values is done at the transformer or estimator level. E.g. the sklearn OneHotEncoder has the option handle_unknown='error'. So if an unseen value is presented to the trained OneHotEncoder it will through an error. In a pipeline this is the place where the value is checked.

@vruusmann vruusmann changed the title Converted sklearn pipeline with TargetEncoder does not work for unseen categorical values Detect invalid value treatment policy based on the "transformer composition" of SkLearn pipeline Dec 3, 2024
@vruusmann
Copy link
Member

vruusmann commented Dec 3, 2024

Thinking out loud for my future self.

The MiningSchema element (together with all MiningField elements) is generated automatically based on the model body (that's why it often appears "pruned" - if a model does not need some input, it is not listed in model's input schema).

The right place for detecting the correct MiningField@invalidValueTreatment (and possibly MiningField@missingValueTreatment) attribute values would be around the same time/place.

Manual detection by each transformer converter seems too complex and fragile in comparison. Also, the automated detection component should land in the JPMML-Converter library, and would be easily reusable in other PMML production libraries such as JPMML-R, JPMML-SparkML, etc. as well.

@vruusmann
Copy link
Member

@fritshermans Thanks for raising the issue! However, a proper fix to it looks like a major change in another library, which may take unspecified amount of time (ie. can't fix it quickly at SkLearn2PMML package level). You keep running your manual PMML post-processing workflow in the meantime.

@fritshermans
Copy link
Author

thanks a lot for your quick response! i'm creating a small regex-replace to fix the pmml :-)

@vruusmann
Copy link
Member

vruusmann commented Dec 3, 2024

i'm creating a small regex-replace to fix the pmml

That should do the job.

But since we're dealing with XML documents, you may also consider using XSL Transformations (XSLT), applied using a small Java or Python application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants