Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the HyperTransformer to be used on a subset of the columns #152

Closed
csala opened this issue Feb 19, 2021 · 0 comments · Fixed by #153
Closed

Allow the HyperTransformer to be used on a subset of the columns #152

csala opened this issue Feb 19, 2021 · 0 comments · Fixed by #153
Assignees
Milestone

Comments

@csala
Copy link
Contributor

csala commented Feb 19, 2021

  • Reversible Data Transforms version: v0.3.0

Description

The current HyperTransformer allows being passed as input for the transform and reverse_transform methods a DataFrame with additional columns that were not seen in the training data. When that happens, the HyperTransformer just ignores the columns and leaves them unmodified.

The opposite, being passed a DataFrame with only a subset of the columns being seen during training, is not possible and makes the HyperTransformer crash. This should be also supported.

Example

This example shows how the HyperTransformer currently crashes when being passed a subset of the training data.

In [1]: import pandas as pd

In [2]: data = pd.DataFrame({
   ...:     'category': ['a', 'b', 'c'],
   ...:     'float': [1., 2., 3.],
   ...: })

In [3]: import rdt

In [4]: ht = rdt.HyperTransformer()

In [5]: ht.fit(data)

In [6]: ht.transform(data[['category']])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/mnt/nvme0n1p2/xals/.virtualenvs/RDT/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2894             try:
-> 2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'float'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-b9f4d352f138> in <module>
----> 1 ht.transform(data[['category']])

/mnt/nvme0n1p2/xals/Projects/MIT/RDT/rdt/hyper_transformer.py in transform(self, data)
    187 
    188         for column_name, transformer in self._transformers.items():
--> 189             column = data.pop(column_name)
    190             transformed = transformer.transform(column)
    191 

/mnt/nvme0n1p2/xals/.virtualenvs/RDT/lib/python3.8/site-packages/pandas/core/frame.py in pop(self, item)
   4369         3  monkey        NaN
   4370         """
-> 4371         return super().pop(item=item)
   4372 
   4373     @doc(NDFrame.replace, **_shared_doc_kwargs)

/mnt/nvme0n1p2/xals/.virtualenvs/RDT/lib/python3.8/site-packages/pandas/core/generic.py in pop(self, item)
    659 
    660     def pop(self, item: Label) -> Union["Series", Any]:
--> 661         result = self[item]
    662         del self[item]
    663         if self.ndim == 2:

/mnt/nvme0n1p2/xals/.virtualenvs/RDT/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2904             if self.columns.nlevels > 1:
   2905                 return self._getitem_multilevel(key)
-> 2906             indexer = self.columns.get_loc(key)
   2907             if is_integer(indexer):
   2908                 indexer = [indexer]

/mnt/nvme0n1p2/xals/.virtualenvs/RDT/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:
-> 2897                 raise KeyError(key) from err
   2898 
   2899         if tolerance is not None:

KeyError: 'float'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant