Using MultiLabelBinarizer #79

IdoZehori · 2018-01-17T14:58:20Z

Hey,

The problem I've encountered is when trying to perform k-hot-encoding with sklearns MultiLabelBinarizer and got the following error.

how do you suggest dealing with columns with multiple categorical features?

Jan 17, 2018 4:52:08 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Jan 17, 2018 4:52:08 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 95 ms.
Jan 17, 2018 4:52:08 PM org.jpmml.sklearn.Main run
INFO: Converting..
Jan 17, 2018 4:52:08 PM sklearn2pmml.PMMLPipeline encodePMML
WARNING: Attribute 'sklearn2pmml.PMMLPipeline.target_fields' is not set. Assuming y as the name of the target field
Jan 17, 2018 4:52:08 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: The value object (Python class sklearn.preprocessing.label.MultiLabelBinarizer) is not a supported Transformer
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
	at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:638)
	at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:72)
	at sklearn.Initializer.encodeFeatures(Initializer.java:53)
	at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:147)
	at org.jpmml.sklearn.Main.run(Main.java:145)
	at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Transformer
	at java.lang.Class.cast(Class.java:3369)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
	... 8 more

Exception in thread "main" java.lang.IllegalArgumentException: The value object (Python class sklearn.preprocessing.label.MultiLabelBinarizer) is not a supported Transformer
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
	at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:638)
	at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:72)
	at sklearn.Initializer.encodeFeatures(Initializer.java:53)
	at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:147)
	at org.jpmml.sklearn.Main.run(Main.java:145)
	at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Transformer
	at java.lang.Class.cast(Class.java:3369)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
	... 8 more
Process failed: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

The text was updated successfully, but these errors were encountered:

vruusmann · 2018-01-17T15:21:50Z

how do you suggest dealing with columns with multiple categorical features?

I don't quite understand the inner workings of MultiLabelBinarizer. Can you 1) explain what's the main functional difference between LabelBinarizer and MultiLabelBinarizer, and 2) share a code example where MultiLabelBinarizer has legit use? For the latter, you could use the Audit dataset (binary classification problem Adjusted ~ .).

I'd be happy to introduce MultiLabelBinarizer support into SkLearn2PMML/JPMML-SkLearn after that.

IdoZehori · 2018-01-17T15:53:08Z

Basically what i need is to transform a column with an iterate in it to a k-hot-encoding type mapping.
Here is a toy example:

    data = pd.DataFrame()
    iterColumn = [['a', 'b'], ['a', 'c'], ['b', 'c']]

    data['iterColumn'] = iterColumn
    data['y'] = 1
    print data
    print MultiLabelBinarizer().fit_transform(data['iterColumn'])

That prints:

  iterColumn  y
0     [a, b]  1
1     [a, c]  1
2     [b, c]  1

[[1 1 0]
 [1 0 1]
 [0 1 1]]

And you can than easily use some sklearn classifier from there.

vruusmann · 2018-01-17T16:42:59Z

Thanks - I think I've got the basic idea of MultiLabelBinarizer now.

In a nutshell, "iterColumn" is a collection-type feature/column, and the MultiLabelBinarizer transformation performs a "collection contains"-query on it (the first column of transformation results corresponds to "collection contains a?", the second to "collection contains b?", etc).

Collection-type features are a bit problematic from the PMML perspective, because it (typically-) operates with scalar-type features only.

I guess the same "features should be scalars" limitation applies to the Scikit-Learn framework as well. You can have collection-type features in the incoming dataset, but you must transform them to scalar-type features in the very beginning of your Scikit-Learn pipeline.

Will need to think about possible technical solutions. I could probably introduce collection-type feature support into JPMML-family of software pretty easily, but it would be pretty difficult to get it approved by DMG.org (that is responsible for maintaining the PMML standard).

vruusmann · 2018-01-17T16:54:23Z

Coming back to your original question - how to deal with columns with multiple categorical features - then the temporary workaround would be to employ the following two-stage workflow:

Take the original dataset, and "explode" single collection-type columns to multiple scalar-type columns (eg. using the MultiLabelBinarizer transformation). Do not do any other feature engineering in this step.
Take the "exploded" dataset, and work with it as usual (feature transformation, estimation).

SkLearn2PMML/JPMML-SkLearn is currently able to handle the second stage. You would need to maintain a separate Python/Java solution for handling the first stage.

Despite the bad situation/outlook, let's keep this issue open - will remind me to think more about it.

vruusmann · 2018-01-17T16:55:33Z

Another issue, where the original dataset contains collection-type features: jpmml/jpmml-sklearn#62

IdoZehori · 2018-01-18T10:01:16Z

Thank for the quick response!
Can you think of some workaround I can try? maybe changing the input type to a dictionary and instead of having [1, 2, 3] have {1:1, 2:1, 3:1}? or something of that nature?

mathlf2015 · 2020-05-21T08:09:43Z

@IdoZehori i met the same problem ,could you tell me how did you finally deal with this problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using MultiLabelBinarizer #79

Using MultiLabelBinarizer #79

IdoZehori commented Jan 17, 2018

vruusmann commented Jan 17, 2018

IdoZehori commented Jan 17, 2018

vruusmann commented Jan 17, 2018

vruusmann commented Jan 17, 2018

vruusmann commented Jan 17, 2018

IdoZehori commented Jan 18, 2018

mathlf2015 commented May 21, 2020

Using MultiLabelBinarizer #79

Using MultiLabelBinarizer #79

Comments

IdoZehori commented Jan 17, 2018

vruusmann commented Jan 17, 2018

IdoZehori commented Jan 17, 2018

vruusmann commented Jan 17, 2018

vruusmann commented Jan 17, 2018

vruusmann commented Jan 17, 2018

IdoZehori commented Jan 18, 2018

mathlf2015 commented May 21, 2020