-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OneHotEncodingTransformer support for lists and lists of lists #137
Conversation
Codecov Report
@@ Coverage Diff @@
## master #137 +/- ##
==========================================
+ Coverage 98.61% 98.65% +0.04%
==========================================
Files 9 9
Lines 434 447 +13
==========================================
+ Hits 428 441 +13
Misses 6 6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a couple of comments.
rdt/transformers/categorical.py
Outdated
""" | ||
if isinstance(data, list): | ||
data = np.array(data) | ||
if len(data.shape) == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should put this block one level higher, outside of the isinstance(data, list)
block?
This way we would also support 2D numpy arrays as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm a little confused here, you mean just delete the isinstance line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean removing the indentation of the if len(data.shape)
lines one level, so they are always run independently on whether the input was a list or a numpy array.
This way, if the input was already a numpy array, its shape is validated too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you mean this?
if isinstance(data, list):
data = np.array(data)
if isinstance(data, np.ndarray):
if len(data.shape) == 2:
if data.shape[1] != 1:
raise ValueError("Unexpected format.")
data = data[:, 0]
elif len(data.shape) > 2:
raise ValueError("Unexpected format.")
return data
Maybe at this point we might as well just support anything np.ndarray does (although I'll have to write a few test cases to make sure it actually works)? Like this:
data = np.array(data)
if len(data.shape) == 2:
if data.shape[1] != 1:
raise ValueError("Unexpected format.")
data = data[:, 0]
elif len(data.shape) > 2:
raise ValueError("Unexpected format.")
return data
rdt/transformers/categorical.py
Outdated
@@ -236,15 +236,21 @@ def _prepare_data(data): | |||
|
|||
Returns: | |||
pandas.Series or numpy.ndarray | |||
|
|||
if isinstance(data, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect these lines have made their way here by error :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, my bad.
Resolves sdv-dev/CTGAN#87