diff --git a/README.md b/README.md index ce4e4cc..07fbaa1 100755 --- a/README.md +++ b/README.md @@ -8,9 +8,9 @@ AutoNormalize is a Python library for automated datatable normalization. It allo ## Getting Started -* [Install](#install) -* [Demos](#demos) -* [API Reference](#api-reference) +- [Install](#install) +- [Demos](#demos) +- [API Reference](#api-reference) ## Install @@ -26,11 +26,11 @@ pip uninstall autonormalize ## Demos -* [Blog Post](https://blog.featurelabs.com/automatic-dataset-normalization-for-feature-engineering-in-python/) -* [Machine Learning Demo with Featuretools](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/AutoNormalize%20%2B%20FeatureTools%20Demo.ipynb) -* [Kaggle Liquor Sales Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Liquor%20Sales%20Dataset%20Demo.ipynb) -* [Demo with Editing Dependencies](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Editing%20Dependnecies%20Demo.ipynb) -* [Kaggle Food Production Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Food%20%20Dataset%20Demo.ipynb) +- [Blog Post](https://blog.featurelabs.com/automatic-dataset-normalization-for-feature-engineering-in-python/) +- [Machine Learning Demo with Featuretools](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/AutoNormalize%20%2B%20FeatureTools%20Demo.ipynb) +- [Kaggle Liquor Sales Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Liquor%20Sales%20Dataset%20Demo.ipynb) +- [Demo with Editing Dependencies](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Editing%20Dependnecies%20Demo.ipynb) +- [Kaggle Food Production Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Food%20%20Dataset%20Demo.ipynb) ## API Reference @@ -44,19 +44,19 @@ Creates a normalized entityset from a dataframe. **Arguments:** -* `df` (pd.Dataframe) : the dataframe containing data +- `df` (pd.Dataframe) : the dataframe containing data -* `accuracy` (0 < float <= 1.00; default = 0.98) : the accuracy threshold required in order to conclude a dependency (i.e. with accuracy = 0.98, 0.98 of the rows must hold true the dependency LHS --> RHS) +- `accuracy` (0 < float <= 1.00; default = 0.98) : the accuracy threshold required in order to conclude a dependency (i.e. with accuracy = 0.98, 0.98 of the rows must hold true the dependency LHS --> RHS) -* `index` (str, optional) : name of column that is intended index of df +- `index` (str, optional) : name of column that is intended index of df -* `name` (str, optional) : the name of created EntitySet +- `name` (str, optional) : the name of created EntitySet -* `time_index` (str, optional) : name of time column in the dataframe. +- `time_index` (str, optional) : name of time column in the dataframe. **Returns:** -* `entityset` (ft.EntitySet) : created entity set +- `entityset` (ft.EntitySet) : created entity set ### `find_dependencies` @@ -68,7 +68,7 @@ Finds dependencies within dataframe with the DFD search algorithm. **Returns:** -* `dependencies` (Dependencies) : the dependencies found in the data within the contraints provided +- `dependencies` (Dependencies) : the dependencies found in the data within the contraints provided ### `normalize_dataframe` @@ -78,13 +78,13 @@ normalize_dataframe(df, dependencies) Normalizes dataframe based on the dependencies given. Keys for the newly created DataFrames can only be columns that are strings, ints, or categories. Keys are chosen according to the priority: -1) shortest lenghts -2) has "id" in some form in the name of an attribute -3) has attribute furthest to left in the table +1. shortest lenghts +2. has "id" in some form in the name of an attribute +3. has attribute furthest to left in the table **Returns:** -* `new_dfs` (list[pd.DataFrame]) : list of new dataframes +- `new_dfs` (list[pd.DataFrame]) : list of new dataframes
@@ -98,25 +98,25 @@ Creates a normalized EntitySet from dataframe based on the dependencies given. K **Returns:** -* `entityset` (ft.EntitySet) : created EntitySet +- `entityset` (ft.EntitySet) : created EntitySet
-### `normalize_entity` +### `normalize_entityset` ```shell -normalize_entity(es, accuracy=0.98) +normalize_entityset(es, accuracy=0.98) ``` Returns a new normalized `EntitySet` from an `EntitySet` with a single entity. **Arguments:** -* `es` (ft.EntitySet) : EntitySet with a single entity to normalize +- `es` (ft.EntitySet) : EntitySet with a single entity to normalize **Returns:** -* `new_es` (ft.EntitySet) : new normalized EntitySet +- `new_es` (ft.EntitySet) : new normalized EntitySet
diff --git a/autonormalize/autonormalize.py b/autonormalize/autonormalize.py index 244bbc9..278b315 100644 --- a/autonormalize/autonormalize.py +++ b/autonormalize/autonormalize.py @@ -85,24 +85,31 @@ def make_entityset(df, dependencies, name=None, time_index=None): normalize.normalize_dataframe(depdf) normalize.make_indexes(depdf) - entities = {} + dataframes = {} relationships = [] stack = [depdf] while stack != []: current = stack.pop() + if (current.df.ww.schema is None): + current.df.ww.init(index=current.index[0], name=current.index[0]) + + current_df_name = current.df.ww.name if time_index in current.df.columns: - entities[current.index[0]] = (current.df, current.index[0], time_index) + dataframes[current_df_name] = (current.df, current.index[0], time_index) else: - entities[current.index[0]] = (current.df, current.index[0]) + dataframes[current_df_name] = (current.df, current.index[0]) for child in current.children: + if (child.df.ww.schema is None): + child.df.ww.init(index=child.index[0], name=child.index[0]) + child_df_name = child.df.ww.name # add to stack # add relationship stack.append(child) - relationships.append((child.index[0], child.index[0], current.index[0], child.index[0])) + relationships.append((child_df_name, child.index[0], current_df_name, child.index[0])) - return ft.EntitySet(name, entities, relationships) + return ft.EntitySet(name, dataframes, relationships) def auto_entityset(df, accuracy=0.98, index=None, name=None, time_index=None): @@ -141,9 +148,9 @@ def auto_normalize(df): return normalize_dataframe(df, find_dependencies(df)) -def normalize_entity(es, accuracy=0.98): +def normalize_entityset(es, accuracy=0.98): """ - Returns a new normalized EntitySet from an EntitySet with a single entity. + Returns a new normalized EntitySet from an EntitySet with a single dataframe. Arguments: es (ft.EntitySet) : EntitySet to normalize @@ -152,13 +159,14 @@ def normalize_entity(es, accuracy=0.98): Returns: new_es (ft.EntitySet) : new normalized EntitySet """ - # TO DO: add option to pass an EntitySet with more than one entity, and specify which one + # TO DO: add option to pass an EntitySet with more than one dataframe, and specify which one # to normalize while preserving existing relationships - if len(es.entities) > 1: - raise ValueError('There is more than one entity in this EntitySet') - if len(es.entities) == 0: + if len(es.dataframes) > 1: + raise ValueError('There is more than one dataframe in this EntitySet') + if len(es.dataframes) == 0: raise ValueError('This EntitySet is empty') - entity = es.entities[0] - new_es = auto_entityset(entity.df, accuracy, index=entity.index, name=es.id, time_index=entity.time_index) + + df = es.dataframes[0] + new_es = auto_entityset(df, accuracy, index=df.ww.index, name=es.id, time_index=df.ww.time_index) return new_es diff --git a/autonormalize/tests/test_example.py b/autonormalize/tests/test_example.py index d8664fd..ac42a63 100644 --- a/autonormalize/tests/test_example.py +++ b/autonormalize/tests/test_example.py @@ -1,5 +1,8 @@ import featuretools as ft +import pandas as pd +from unittest.mock import patch +import pytest import autonormalize as an @@ -21,3 +24,30 @@ def test_ft_mock_customer(): assert set([str(rel) for rel in entityset.relationships]) == set([' session_id.session_id>', ' product_id.product_id>', ' customer_id.customer_id>']) + + +@patch("autonormalize.autonormalize.auto_entityset") +def test_normalize_entityset(auto_entityset): + df1 = pd.DataFrame({"test": [0, 1, 2]}) + df2 = pd.DataFrame({"test": [0, 1, 2]}) + accuracy = 0.98 + + es = ft.EntitySet() + + error = "This EntitySet is empty" + with pytest.raises(ValueError, match=error): + an.normalize_entityset(es, accuracy) + + es.add_dataframe(df1, "df") + + df_out = es.dataframes[0] + + an.normalize_entityset(es, accuracy) + + auto_entityset.assert_called_with(df_out, accuracy, index=df_out.ww.index, name=es.id, time_index=df_out.ww.time_index) + + es.add_dataframe(df2, "df2") + + error = "There is more than one dataframe in this EntitySet" + with pytest.raises(ValueError, match=error): + an.normalize_entityset(es, accuracy) diff --git a/dev-requirements.txt b/dev-requirements.txt index f26d883..2940ec9 100644 --- a/dev-requirements.txt +++ b/dev-requirements.txt @@ -3,9 +3,9 @@ codecov==2.1.8 flake8==3.7.8 autopep8==1.4.4 isort==4.3.21 -nbsphinx==0.8.5 -pydata-sphinx-theme==0.4.0 -Sphinx==3.2.1 +nbsphinx==0.8.7 +pydata-sphinx-theme==0.7.1 +Sphinx==4.2.0 nbconvert==6.0.2 ipython==7.16.3 pygments==2.8.1 diff --git a/docs/source/api_reference.rst b/docs/source/api_reference.rst index 270fd48..88c3af9 100755 --- a/docs/source/api_reference.rst +++ b/docs/source/api_reference.rst @@ -16,7 +16,7 @@ Autonormalize make_entityset auto_entityset auto_normalize - normalize_entity + normalize_entityset Dependencies ====================== diff --git a/docs/source/release_notes.rst b/docs/source/release_notes.rst index 11edc63..445182f 100755 --- a/docs/source/release_notes.rst +++ b/docs/source/release_notes.rst @@ -3,34 +3,41 @@ Release Notes ------------- -.. Future Release - ============== +Future Release +============== * Enhancements * Fixes + * Fix compatibility issues with featuretools (:pr:`41`) * Changes + * Rename ``normalize_entity`` to ``normalize_entityset`` (:pr:`41`) * Documentation Changes * Testing Changes -.. Thanks to the following people for contributing to this release: + Thanks to the following people for contributing to this release: + :user:`dvreed77` + +Breaking Changes +++++++++++++++++ + * :pr:`41`: The function ``normalize_entity`` has been renamed to ``normalize_entityset``. v1.0.1 Jan 7, 2022 ================== * Documentation Changes - * Update release notes and release format (:pr:`37`) - * Updated sphinx documentation and guides (:pr:`35`) + * Update release notes and release format (:pr:`37`) + * Updated sphinx documentation and guides (:pr:`35`) * Testing Changes - * Updated tests to work with featuretools 1.0 (:pr:`35`) + * Updated tests to work with featuretools 1.0 (:pr:`35`) - Thanks to the following people for contributing to this release: - :user:`gsheni`, :user:`tuethan1999` + Thanks to the following people for contributing to this release: + :user:`gsheni`, :user:`tuethan1999` v1.0.0 Aug 15, 2019 =================== * Initial Release - Thanks to the following people for contributing to this release: - :user:`allisonportis` + Thanks to the following people for contributing to this release: + :user:`allisonportis` .. command .. git log --pretty=oneline --abbrev-commit