You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working with a home-grown synthesizer that is able to synthesize relatively rare categorical values (i.e. one that occurs maybe 3 or 4 times in a table of thousands of rows).
This is all fine and good, but a problem I have is that when I want to run a model (say sdmetrics.single_table.LinearRegression.compute() on the synthetic data, it can occasionally happen that no instances of that value show up in the test data (randomly sampled from the original data), whereas some instances of that value show up in the training data (randomly sampled from the synthesized data).
This in turn causes the ML Efficacy measures to fault with a message like this:
ValueError: Found unknown categories ['fake'] in column 0 during transform
This can be avoided by setting handle_unknown='ignore' in the sklearn encoders (i.e. enc = OneHotEncoder(handle_unknown='ignore') in def fit(self, data): in class HyperTransformer():).
Unfortunately there is no way to set the handle_unknown parameter from sdmetrics. As a result, there is no way for me to complete these measures (short of hard-coding the parameter in sklearn itself). I could probably to a try-except around the efficacy measure, but this still doesn't allow the measure itself to complete.
Expected behavior
Allow the handle_unknown flag to be specified in the model.compute() calls. (Either explicitly or allowing some kind of parameter pass-through to sklean.)
The text was updated successfully, but these errors were encountered:
Hi @yoid2000, I transferred this issue into SDMetrics as this is the underlying library that implements the metric.
I can replicate this error and will classify this as a bug.
The expectation is that the training data does contain all possible values, since this is crucial information for forming the Linear Regression model. I agree that it should be ok if the test data does not contain all possible category values.
Root Cause
This error seems to be related to #291. It appears that the transformation (preprocessing) is using the wrong dataset to fit.
Observed: The code is fitting the transformers on the test_data and then applying this to the train_data. That's why it's expecting all categories to be in the test data.
Expected: The code should fit on the train_data and then apply it to the test_data. We expect all categories to be present during training but it does not matter during test.
Problem Description
I am working with a home-grown synthesizer that is able to synthesize relatively rare categorical values (i.e. one that occurs maybe 3 or 4 times in a table of thousands of rows).
This is all fine and good, but a problem I have is that when I want to run a model (say
sdmetrics.single_table.LinearRegression.compute()
on the synthetic data, it can occasionally happen that no instances of that value show up in the test data (randomly sampled from the original data), whereas some instances of that value show up in the training data (randomly sampled from the synthesized data).This in turn causes the ML Efficacy measures to fault with a message like this:
ValueError: Found unknown categories ['fake'] in column 0 during transform
This can be avoided by setting
handle_unknown='ignore'
in the sklearn encoders (i.e.enc = OneHotEncoder(handle_unknown='ignore')
indef fit(self, data):
inclass HyperTransformer():
).Unfortunately there is no way to set the
handle_unknown
parameter from sdmetrics. As a result, there is no way for me to complete these measures (short of hard-coding the parameter in sklearn itself). I could probably to a try-except around the efficacy measure, but this still doesn't allow the measure itself to complete.Expected behavior
Allow the
handle_unknown
flag to be specified in themodel.compute()
calls. (Either explicitly or allowing some kind of parameter pass-through to sklean.)The text was updated successfully, but these errors were encountered: