Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compatibility issues with Featuretools #41

Merged
merged 13 commits into from
Mar 9, 2022
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 24 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ AutoNormalize is a Python library for automated datatable normalization. It allo

## Getting Started

* [Install](#install)
* [Demos](#demos)
* [API Reference](#api-reference)
- [Install](#install)
- [Demos](#demos)
- [API Reference](#api-reference)

## Install

Expand All @@ -26,11 +26,11 @@ pip uninstall autonormalize

## Demos

* [Blog Post](https://blog.featurelabs.com/automatic-dataset-normalization-for-feature-engineering-in-python/)
* [Machine Learning Demo with Featuretools](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/AutoNormalize%20%2B%20FeatureTools%20Demo.ipynb)
* [Kaggle Liquor Sales Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Liquor%20Sales%20Dataset%20Demo.ipynb)
* [Demo with Editing Dependencies](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Editing%20Dependnecies%20Demo.ipynb)
* [Kaggle Food Production Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Food%20%20Dataset%20Demo.ipynb)
- [Blog Post](https://blog.featurelabs.com/automatic-dataset-normalization-for-feature-engineering-in-python/)
- [Machine Learning Demo with Featuretools](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/AutoNormalize%20%2B%20FeatureTools%20Demo.ipynb)
- [Kaggle Liquor Sales Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Liquor%20Sales%20Dataset%20Demo.ipynb)
- [Demo with Editing Dependencies](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Editing%20Dependnecies%20Demo.ipynb)
- [Kaggle Food Production Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Food%20%20Dataset%20Demo.ipynb)

## API Reference

Expand All @@ -44,19 +44,19 @@ Creates a normalized entityset from a dataframe.

**Arguments:**

* `df` (pd.Dataframe) : the dataframe containing data
- `df` (pd.Dataframe) : the dataframe containing data

* `accuracy` (0 < float <= 1.00; default = 0.98) : the accuracy threshold required in order to conclude a dependency (i.e. with accuracy = 0.98, 0.98 of the rows must hold true the dependency LHS --> RHS)
- `accuracy` (0 < float <= 1.00; default = 0.98) : the accuracy threshold required in order to conclude a dependency (i.e. with accuracy = 0.98, 0.98 of the rows must hold true the dependency LHS --> RHS)

* `index` (str, optional) : name of column that is intended index of df
- `index` (str, optional) : name of column that is intended index of df

* `name` (str, optional) : the name of created EntitySet
- `name` (str, optional) : the name of created EntitySet

* `time_index` (str, optional) : name of time column in the dataframe.
- `time_index` (str, optional) : name of time column in the dataframe.

**Returns:**

* `entityset` (ft.EntitySet) : created entity set
- `entityset` (ft.EntitySet) : created entity set

### `find_dependencies`

Expand All @@ -68,7 +68,7 @@ Finds dependencies within dataframe with the DFD search algorithm.

**Returns:**

* `dependencies` (Dependencies) : the dependencies found in the data within the contraints provided
- `dependencies` (Dependencies) : the dependencies found in the data within the contraints provided

### `normalize_dataframe`

Expand All @@ -78,13 +78,13 @@ normalize_dataframe(df, dependencies)

Normalizes dataframe based on the dependencies given. Keys for the newly created DataFrames can only be columns that are strings, ints, or categories. Keys are chosen according to the priority:

1) shortest lenghts
2) has "id" in some form in the name of an attribute
3) has attribute furthest to left in the table
1. shortest lenghts
2. has "id" in some form in the name of an attribute
3. has attribute furthest to left in the table

**Returns:**

* `new_dfs` (list[pd.DataFrame]) : list of new dataframes
- `new_dfs` (list[pd.DataFrame]) : list of new dataframes

<br />

Expand All @@ -98,25 +98,25 @@ Creates a normalized EntitySet from dataframe based on the dependencies given. K

**Returns:**

* `entityset` (ft.EntitySet) : created EntitySet
- `entityset` (ft.EntitySet) : created EntitySet

<br />

### `normalize_entity`
### `normalize_entityset`

```shell
normalize_entity(es, accuracy=0.98)
normalize_entityset(es, accuracy=0.98)
```

Returns a new normalized `EntitySet` from an `EntitySet` with a single entity.

**Arguments:**

* `es` (ft.EntitySet) : EntitySet with a single entity to normalize
- `es` (ft.EntitySet) : EntitySet with a single entity to normalize

**Returns:**

* `new_es` (ft.EntitySet) : new normalized EntitySet
- `new_es` (ft.EntitySet) : new normalized EntitySet

<br />

Expand Down
34 changes: 21 additions & 13 deletions autonormalize/autonormalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,24 +85,31 @@ def make_entityset(df, dependencies, name=None, time_index=None):
normalize.normalize_dataframe(depdf)
normalize.make_indexes(depdf)

entities = {}
dataframes = {}
relationships = []

stack = [depdf]

while stack != []:
current = stack.pop()
if (current.df.ww.schema is None):
current.df.ww.init(index=current.index[0], name=current.index[0])

current_df_name = current.df.ww.name
if time_index in current.df.columns:
entities[current.index[0]] = (current.df, current.index[0], time_index)
dataframes[current_df_name] = (current.df, current.index[0], time_index)
else:
entities[current.index[0]] = (current.df, current.index[0])
dataframes[current_df_name] = (current.df, current.index[0])
for child in current.children:
if (child.df.ww.schema is None):
child.df.ww.init(index=child.index[0], name=child.index[0])
child_df_name = child.df.ww.name
# add to stack
# add relationship
stack.append(child)
relationships.append((child.index[0], child.index[0], current.index[0], child.index[0]))
relationships.append((child_df_name, child.index[0], current_df_name, child.index[0]))

return ft.EntitySet(name, entities, relationships)
return ft.EntitySet(name, dataframes, relationships)


def auto_entityset(df, accuracy=0.98, index=None, name=None, time_index=None):
Expand Down Expand Up @@ -141,9 +148,9 @@ def auto_normalize(df):
return normalize_dataframe(df, find_dependencies(df))


def normalize_entity(es, accuracy=0.98):
def normalize_entityset(es, accuracy=0.98):
"""
Returns a new normalized EntitySet from an EntitySet with a single entity.
Returns a new normalized EntitySet from an EntitySet with a single dataframe.

Arguments:
es (ft.EntitySet) : EntitySet to normalize
Expand All @@ -152,13 +159,14 @@ def normalize_entity(es, accuracy=0.98):
Returns:
new_es (ft.EntitySet) : new normalized EntitySet
"""
# TO DO: add option to pass an EntitySet with more than one entity, and specify which one
# TO DO: add option to pass an EntitySet with more than one dataframe, and specify which one
# to normalize while preserving existing relationships

if len(es.entities) > 1:
raise ValueError('There is more than one entity in this EntitySet')
if len(es.entities) == 0:
if len(es.dataframes) > 1:
raise ValueError('There is more than one dataframe in this EntitySet')
if len(es.dataframes) == 0:
raise ValueError('This EntitySet is empty')
entity = es.entities[0]
new_es = auto_entityset(entity.df, accuracy, index=entity.index, name=es.id, time_index=entity.time_index)

df = es.dataframes[0]
new_es = auto_entityset(df, accuracy, index=df.ww.index, name=es.id, time_index=df.ww.time_index)
return new_es
30 changes: 30 additions & 0 deletions autonormalize/tests/test_example.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
import featuretools as ft
import pandas as pd
from unittest.mock import patch

import pytest
import autonormalize as an


Expand All @@ -21,3 +24,30 @@ def test_ft_mock_customer():
assert set([str(rel) for rel in entityset.relationships]) == set(['<Relationship: transaction_id.session_id -> session_id.session_id>',
'<Relationship: transaction_id.product_id -> product_id.product_id>',
'<Relationship: session_id.customer_id -> customer_id.customer_id>'])


@patch("autonormalize.autonormalize.auto_entityset")
def test_normalize_entityset(auto_entityset):
df1 = pd.DataFrame({"test": [0, 1, 2]})
df2 = pd.DataFrame({"test": [0, 1, 2]})
accuracy = 0.98

es = ft.EntitySet()

error = "This EntitySet is empty"
with pytest.raises(ValueError, match=error):
an.normalize_entityset(es, accuracy)

es.add_dataframe(df1, "df")

df_out = es.dataframes[0]

an.normalize_entityset(es, accuracy)

auto_entityset.assert_called_with(df_out, accuracy, index=df_out.ww.index, name=es.id, time_index=df_out.ww.time_index)

es.add_dataframe(df2, "df2")

error = "There is more than one dataframe in this EntitySet"
with pytest.raises(ValueError, match=error):
an.normalize_entityset(es, accuracy)
6 changes: 3 additions & 3 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ codecov==2.1.8
flake8==3.7.8
autopep8==1.4.4
isort==4.3.21
nbsphinx==0.8.5
pydata-sphinx-theme==0.4.0
Sphinx==3.2.1
nbsphinx==0.8.7
pydata-sphinx-theme==0.7.1
Sphinx==4.2.0
nbconvert==6.0.2
ipython==7.16.3
pygments==2.8.1
Expand Down
2 changes: 1 addition & 1 deletion docs/source/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Autonormalize
make_entityset
auto_entityset
auto_normalize
normalize_entity
normalize_entityset

Dependencies
======================
Expand Down
23 changes: 13 additions & 10 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,34 +3,37 @@
Release Notes
-------------

.. Future Release
==============
Future Release
==============
* Enhancements
* Fixes
* Fix compatibility issues with featuretools (:pr:`41`)
* Changes
* Rename `normalize_entity` to `normalize_entityset` (:pr:`41`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a breaking change to the API, we typically have been including a Breaking Change note in the release notes. Here is an example: https://github.com/alteryx/woodwork/blob/main/docs/source/release_notes.rst#breaking-changes

* Documentation Changes
* Testing Changes

.. Thanks to the following people for contributing to this release:
Thanks to the following people for contributing to this release:
:user:`dvreed77`

v1.0.1 Jan 7, 2022
==================
* Documentation Changes
* Update release notes and release format (:pr:`37`)
* Updated sphinx documentation and guides (:pr:`35`)
* Update release notes and release format (:pr:`37`)
* Updated sphinx documentation and guides (:pr:`35`)
* Testing Changes
* Updated tests to work with featuretools 1.0 (:pr:`35`)
* Updated tests to work with featuretools 1.0 (:pr:`35`)

Thanks to the following people for contributing to this release:
:user:`gsheni`, :user:`tuethan1999`
Thanks to the following people for contributing to this release:
:user:`gsheni`, :user:`tuethan1999`


v1.0.0 Aug 15, 2019
===================
* Initial Release

Thanks to the following people for contributing to this release:
:user:`allisonportis`
Thanks to the following people for contributing to this release:
:user:`allisonportis`

.. command
.. git log --pretty=oneline --abbrev-commit