Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable types not preserved after call to normalize_entity() #10

Open
j-grover opened this issue Sep 26, 2019 · 7 comments · May be fixed by #12
Open

Variable types not preserved after call to normalize_entity() #10

j-grover opened this issue Sep 26, 2019 · 7 comments · May be fixed by #12

Comments

@j-grover
Copy link

Reproducible example:

import pandas as pd
import featuretools as ft

from featuretools.variable_types import IPAddress
from autonormalize import autonormalize as an

input_df = pd.DataFrame(
    {
        'ip_address': ['128.101.101.101', '1.120.0.0', '17.86.21.0', '23.1.23.255'],
        'length': [900, 60, 20, 30],
        'city': ['adl', 'syd', 'adl', 'syd'],
        'country': ['aus', 'aus', 'aus', 'aus'],
        'is_threat': [True, False, False, False]
    }
)

variable_types = {'ip_address': IPAddress}

es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data',
                         dataframe=input_df,
                         index='index',
                         variable_types=variable_types,
                         make_index=True)

Column ip_address is set to dtype featuretools.variable_types.IPAddress:

print(es['data'].variables)

[<Variable: index (dtype = index)>, 
<Variable: length (dtype = numeric)>, 
<Variable: city (dtype = categorical)>, 
<Variable: country (dtype = categorical)>, 
<Variable: is_threat (dtype = boolean)>, 
<Variable: ip_address (dtype = ip)>]

After normalisation, ip_address resolves back to categorical:

normalized_es = an.normalize_entity(es)

for entity in normalized_es.entity_dict:
    print(normalized_es.entity_dict[entity].variables)
Entity: index
[<Variable: index (dtype = index)>, 
<Variable: length (dtype = numeric)>, 
<Variable: city (dtype = id)>, 
<Variable: is_threat (dtype = boolean)>, 
<Variable: ip_address (dtype = categorical)>]
Entity: city
[<Variable: city (dtype = index)>, <Variable: country (dtype = categorical)>]

To get the desired features, the variable types need to be preserved so the right primitives can be applied when running dfs. My question is whether this should be the desired behaviour or do the variable types need to be set manually again?

@kmax12
Copy link
Contributor

kmax12 commented Sep 26, 2019

@j-grover is this an issue with autonormalize or Featuretools? If featuretools, please post as an issue that that repo: https://github.com/featuretools/featuretools/

@j-grover
Copy link
Author

@kmax12

For reference: autonormalize.py

The normalization of a EntitySet follows the following call graph:
normalize_entity -> auto_entityset -> make_entityset

According to my understanding, the variable types are not carried forward from normalize_entity to auto_entityset. So when entities are created in make_entityset, we do not have variable types:

if time_index in current.df.columns:
    entities[current.index[0]] = (current.df, current.index[0], time_index)
else:
    entities[current.index[0]] = (current.df, current.index[0])

Entities definition:

"""
entities (dict[str -> tuple(pd.DataFrame, str, str)]): Dictionary of
                    entities. Entries take the format
                    {entity id -> (dataframe, id column, (time_column), (variable_types))}.
                    Note that time_column and variable_types are optional.
"""

@kmax12
Copy link
Contributor

kmax12 commented Sep 27, 2019

@j-grover thanks for clarification. I see the issue now.

you're right that we aren't carrying the variable types through. would you be interested in submitting a PR that does that?

@j-grover
Copy link
Author

@j-grover thanks for clarification. I see the issue now.

you're right that we aren't carrying the variable types through. would you be interested in submitting a PR that does that?

Yeah sure, I'll give it a go.

@j-grover
Copy link
Author

j-grover commented Sep 30, 2019

@kmax12 I have a branch ready, I believe I do not have access to push.

@kmax12
Copy link
Contributor

kmax12 commented Oct 17, 2019

@j-grover can you create a fork to make the pull request?

@j-grover j-grover linked a pull request Oct 18, 2019 that will close this issue
@j-grover
Copy link
Author

@j-grover can you create a fork to make the pull request?

Thanks, created PR.

j-grover added a commit to j-grover/autonormalize that referenced this issue Mar 15, 2020
j-grover added a commit to j-grover/autonormalize that referenced this issue Mar 15, 2020
j-grover added a commit to j-grover/autonormalize that referenced this issue Mar 15, 2020
j-grover added a commit to j-grover/autonormalize that referenced this issue Apr 13, 2020
j-grover added a commit to j-grover/autonormalize that referenced this issue Apr 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants