Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneHotEncodingTransformer support for lists and lists of lists #137

Merged
merged 5 commits into from
Nov 18, 2020

Conversation

fealho
Copy link
Member

@fealho fealho commented Nov 13, 2020

Resolves sdv-dev/CTGAN#87

@codecov-io
Copy link

codecov-io commented Nov 13, 2020

Codecov Report

Merging #137 (0d178e8) into master (3ee57fd) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #137      +/-   ##
==========================================
+ Coverage   98.61%   98.65%   +0.04%     
==========================================
  Files           9        9              
  Lines         434      447      +13     
==========================================
+ Hits          428      441      +13     
  Misses          6        6              
Impacted Files Coverage Δ
rdt/transformers/categorical.py 95.93% <100.00%> (+0.48%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ee57fd...0d178e8. Read the comment docs.

@fealho fealho requested a review from csala November 13, 2020 06:25
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a couple of comments.

"""
if isinstance(data, list):
data = np.array(data)
if len(data.shape) == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should put this block one level higher, outside of the isinstance(data, list) block?

This way we would also support 2D numpy arrays as input.

Copy link
Member Author

@fealho fealho Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm a little confused here, you mean just delete the isinstance line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean removing the indentation of the if len(data.shape) lines one level, so they are always run independently on whether the input was a list or a numpy array.

This way, if the input was already a numpy array, its shape is validated too.

Copy link
Member Author

@fealho fealho Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean this?

if isinstance(data, list):
    data = np.array(data)

if isinstance(data, np.ndarray):
    if len(data.shape) == 2:
        if data.shape[1] != 1:
            raise ValueError("Unexpected format.")
    
        data = data[:, 0]

    elif len(data.shape) > 2:
        raise ValueError("Unexpected format.")
    
return data

Maybe at this point we might as well just support anything np.ndarray does (although I'll have to write a few test cases to make sure it actually works)? Like this:

data = np.array(data)
if len(data.shape) == 2:
    if data.shape[1] != 1:
        raise ValueError("Unexpected format.")

    data = data[:, 0]

elif len(data.shape) > 2:
    raise ValueError("Unexpected format.")

return data

rdt/transformers/categorical.py Outdated Show resolved Hide resolved
@fealho fealho requested a review from csala November 13, 2020 19:52
@@ -236,15 +236,21 @@ def _prepare_data(data):

Returns:
pandas.Series or numpy.ndarray

if isinstance(data, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect these lines have made their way here by error :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, my bad.

@fealho fealho merged commit c2842b6 into master Nov 18, 2020
@fealho fealho deleted the prepare-data branch November 18, 2020 02:46
@csala csala added the bug Something isn't working label Nov 20, 2020
@csala csala added this to the 0.2.8 milestone Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sample function doesn't work
3 participants