Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneHotEncodingTransformer fails if there is only one category #119

Closed
jpcoblentz opened this issue Sep 10, 2020 · 6 comments · Fixed by #120
Closed

OneHotEncodingTransformer fails if there is only one category #119

jpcoblentz opened this issue Sep 10, 2020 · 6 comments · Fixed by #120
Assignees
Labels
bug Something isn't working
Milestone

Comments

@jpcoblentz
Copy link

  • Most Recent Version Of SDV
  • 3.6
  • REHL

Description

Trying to generate synthetic data from a proprietary dataset and running into scipy runtime errors

What I Did

model = GaussianCopula(field_transformers=typemap)
/home/jacob/.local/lib/python3.6/site-packages/scipy/stats/_continuous_distns.py:621: RuntimeWarning: invalid value encountered in sqrt
  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
/home/jacob/.local/lib/python3.6/site-packages/scipy/optimize/minpack.py:175: RuntimeWarning: The number of calls to function has reached maxfev = 600.
  warnings.warn(msg, RuntimeWarning)
/home/jacob/.local/lib/python3.6/site-packages/copulas/univariate/truncated_gaussian.py:43: RuntimeWarning: divide by zero encountered in double_scalars
  a = (self.min - loc) / scale
/home/jacob/.local/lib/python3.6/site-packages/copulas/univariate/truncated_gaussian.py:44: RuntimeWarning: invalid value encountered in double_scalars```
@csala
Copy link
Contributor

csala commented Sep 10, 2020

Hello @jpcoblentz

These messages are not really errors, but rather warnings. They are there to inform about a situation that seems strange but that does not truly prevent the software from running properly.
In this case they are expected, and they happen because there is a few numerical optimization processes within the copulas library which sometimes hit invalid scenarios on the way (like divide by zero), so you can safely ignore them.

If you want to go a bit further and just silence them, you can execute the following two lines in your python session (or jupyter notebook) and they should go away:

import warnings
warnings.simplefilter('ignore')

Of course, if the software is not working, in the sense that it is not producing results, please let us know and we'll try to help figure out the problem.

@jpcoblentz
Copy link
Author

it's not producing results, after running -- I thought those warnings would be indicative on a larger problem.

model.sample(1000)

Produces




AxisError                                 Traceback (most recent call last)
<ipython-input-58-afde9c2645be> in <module>()
----> 1 model.sample(1000)

~/.local/lib/python3.6/site-packages/sdv/tabular/base.py in sample(self, num_rows, max_retries)
    143         num_to_sample = num_rows
    144         sampled = self._sample(num_to_sample)
--> 145         sampled = self._metadata.reverse_transform(sampled)
    146         sampled = self._metadata.filter_valid(sampled)
    147         num_valid = len(sampled)

~/.local/lib/python3.6/site-packages/sdv/metadata/table.py in reverse_transform(self, data)
    485             raise MetadataNotFittedError()
    486 
--> 487         reversed_data = self._hyper_transformer.reverse_transform(data)
    488 
    489         for constraint in self._constraints:

~/.local/lib/python3.6/site-packages/rdt/hyper_transformer.py in reverse_transform(self, data)
    225         for column_name, transformer in self._transformers.items():
    226             columns = self._get_columns(data, column_name)
--> 227             data[column_name] = transformer.reverse_transform(columns)
    228 
    229         return data

~/.local/lib/python3.6/site-packages/rdt/transformers/categorical.py in reverse_transform(self, data)
    257             pandas.Series
    258         """
--> 259         indices = np.argmax(data, axis=1)
    260         return pd.Series(indices).map(self.dummies)
    261 

<__array_function__ internals> in argmax(*args, **kwargs)

~/.local/lib/python3.6/site-packages/numpy/core/fromnumeric.py in argmax(a, axis, out)
   1186 
   1187     """
-> 1188     return _wrapfunc(a, 'argmax', axis=axis, out=out)
   1189 
   1190 

~/.local/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     56 
     57     try:
---> 58         return bound(*args, **kwds)
     59     except TypeError:
     60         # A TypeError occurs if the object does have such a method in its

AxisError: axis 1 is out of bounds for array of dimension 1

@csala
Copy link
Contributor

csala commented Sep 10, 2020

Wow, yes, that's a more serious thing. But it's hard to tell what's going on only from the traceback.

Would you be able to post a code snippet with the steps that you executed and, if possible, the data that produces the error?Please feel free to join our slack workspace and contact me directly if that makes sharing the data easier.

@csala
Copy link
Contributor

csala commented Sep 10, 2020

I also realized now that you are actually using SDV and the error seems to be raised from RDT, so I'm transferring the issue to SDV to follow up there.

@csala csala transferred this issue from sdv-dev/Copulas Sep 10, 2020
@csala
Copy link
Contributor

csala commented Sep 10, 2020

It seems that the error could be caused by a categorical variable that has only one unique value in it.

Here's a reproducible example:

In [13]: import pandas as pd 
    ...:  
    ...: df = pd.DataFrame({ 
    ...:     'a': ['a', 'a'] 
    ...: })                                                                                                                           

In [14]: from rdt.hyper_transformer import HyperTransformer 
    ...: from rdt.transformers import OneHotEncodingTransformer 
    ...:  
    ...: ht = HyperTransformer({'a': OneHotEncodingTransformer()}) 
    ...:  
    ...: ht.reverse_transform(ht.fit_transform(df))                                                                                   
---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-14-749be688644d> in <module>
      4 ht = HyperTransformer({'a': OneHotEncodingTransformer()})
      5 
----> 6 ht.reverse_transform(ht.fit_transform(df))

~/.virtualenvs/SDV.clean/lib/python3.6/site-packages/rdt/hyper_transformer.py in reverse_transform(self, data)
    225         for column_name, transformer in self._transformers.items():
    226             columns = self._get_columns(data, column_name)
--> 227             data[column_name] = transformer.reverse_transform(columns)
    228 
    229         return data

~/.virtualenvs/SDV.clean/lib/python3.6/site-packages/rdt/transformers/categorical.py in reverse_transform(self, data)
    257             pandas.Series
    258         """
--> 259         indices = np.argmax(data, axis=1)
    260         return pd.Series(indices).map(self.dummies)
    261 

<__array_function__ internals> in argmax(*args, **kwargs)

~/.virtualenvs/SDV.clean/lib/python3.6/site-packages/numpy/core/fromnumeric.py in argmax(a, axis, out)
   1186 
   1187     """
-> 1188     return _wrapfunc(a, 'argmax', axis=axis, out=out)
   1189 
   1190 

~/.virtualenvs/SDV.clean/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     56 
     57     try:
---> 58         return bound(*args, **kwds)
     59     except TypeError:
     60         # A TypeError occurs if the object does have such a method in its

AxisError: axis 1 is out of bounds for array of dimension 1

@csala
Copy link
Contributor

csala commented Sep 10, 2020

Also, since the error actually comes from RDT, I'm transferring again the issue there, where it will be stay until it is fixed.

@csala csala transferred this issue from sdv-dev/SDV Sep 10, 2020
@csala csala added the bug Something isn't working label Sep 10, 2020
@csala csala self-assigned this Sep 10, 2020
@csala csala added this to the 0.2.5 milestone Sep 11, 2020
@csala csala closed this as completed Sep 11, 2020
@csala csala changed the title Scipy Runtime Errors OneHotEncodingTransformer fails if there is only one category Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants