Allow the HyperTransformer to be used on a subset of the columns #152

csala · 2021-02-19T11:49:43Z

Reversible Data Transforms version: v0.3.0

Description

The current HyperTransformer allows being passed as input for the transform and reverse_transform methods a DataFrame with additional columns that were not seen in the training data. When that happens, the HyperTransformer just ignores the columns and leaves them unmodified.

The opposite, being passed a DataFrame with only a subset of the columns being seen during training, is not possible and makes the HyperTransformer crash. This should be also supported.

Example

This example shows how the HyperTransformer currently crashes when being passed a subset of the training data.

In [1]: import pandas as pd

In [2]: data = pd.DataFrame({
   ...:     'category': ['a', 'b', 'c'],
   ...:     'float': [1., 2., 3.],
   ...: })

In [3]: import rdt

In [4]: ht = rdt.HyperTransformer()

In [5]: ht.fit(data)

In [6]: ht.transform(data[['category']])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/mnt/nvme0n1p2/xals/.virtualenvs/RDT/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2894             try:
-> 2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'float'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-b9f4d352f138> in <module>
----> 1 ht.transform(data[['category']])

/mnt/nvme0n1p2/xals/Projects/MIT/RDT/rdt/hyper_transformer.py in transform(self, data)
    187 
    188         for column_name, transformer in self._transformers.items():
--> 189             column = data.pop(column_name)
    190             transformed = transformer.transform(column)
    191 

/mnt/nvme0n1p2/xals/.virtualenvs/RDT/lib/python3.8/site-packages/pandas/core/frame.py in pop(self, item)
   4369         3  monkey        NaN
   4370         """
-> 4371         return super().pop(item=item)
   4372 
   4373     @doc(NDFrame.replace, **_shared_doc_kwargs)

/mnt/nvme0n1p2/xals/.virtualenvs/RDT/lib/python3.8/site-packages/pandas/core/generic.py in pop(self, item)
    659 
    660     def pop(self, item: Label) -> Union["Series", Any]:
--> 661         result = self[item]
    662         del self[item]
    663         if self.ndim == 2:

/mnt/nvme0n1p2/xals/.virtualenvs/RDT/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2904             if self.columns.nlevels > 1:
   2905                 return self._getitem_multilevel(key)
-> 2906             indexer = self.columns.get_loc(key)
   2907             if is_integer(indexer):
   2908                 indexer = [indexer]

/mnt/nvme0n1p2/xals/.virtualenvs/RDT/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:
-> 2897                 raise KeyError(key) from err
   2898 
   2899         if tolerance is not None:

KeyError: 'float'

The text was updated successfully, but these errors were encountered:

csala self-assigned this Feb 19, 2021

csala added the enhancement label Feb 19, 2021

csala added this to the 0.4.0 milestone Feb 19, 2021

csala mentioned this issue Feb 19, 2021

Support transforming and reversing a subset of the training columns #153

Merged

csala closed this as completed in #153 Feb 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow the HyperTransformer to be used on a subset of the columns #152

Allow the HyperTransformer to be used on a subset of the columns #152

csala commented Feb 19, 2021

Allow the HyperTransformer to be used on a subset of the columns #152

Allow the HyperTransformer to be used on a subset of the columns #152

Comments

csala commented Feb 19, 2021

Description

Example