Add validate to `BaseMultiTableSynthesizer` #1092

pvk-developer · 2022-11-04T15:45:39Z

Resolves #1071

codecov-commenter · 2022-11-04T15:52:10Z

Codecov Report

Base: 81.06% // Head: 81.20% // Increases project coverage by +0.13% 🎉

Coverage data is based on head (53424b2) compared to base (3dea525).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head 53424b2 differs from pull request most recent head bd0f932. Consider uploading reports for the commit bd0f932 to get more accurate results

Additional details and impacted files

@@              Coverage Diff               @@
##           V1.0.0.dev    #1092      +/-   ##
==============================================
+ Coverage       81.06%   81.20%   +0.13%     
==============================================
  Files              57       57              
  Lines            4769     4804      +35     
==============================================
+ Hits             3866     3901      +35     
  Misses            903      903

Impacted Files	Coverage Δ
sdv/multi_table/base.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

fealho · 2022-11-04T18:08:27Z

sdv/multi_table/base.py

+        """Validate data.
+
+        Args:
+            data (dict):


I'd call it tables. Data makes me think it's just a dataframe

I'm fine with tables, I just followed the issue specification, @amontanez24 green light on the tables rather than data?

@npatki I think this was set to data to keep consistency with the SingleTableSynthesizer method, but since it does take a different type as input maybe it would make sense to change?

I see the same issue with tables though -- it does not imply anything about this being a dictionary. In fact, I'd assume it's a list if I just saw the word.

To me, data does not imply a DataFrame but rather the general concept of any collected data. The nice thing about using data is that whenever we do a demo to future users, etc. it's very clear what the concept is (even if the exact type of argument is not clear)

sdv/multi_table/base.py

tests/unit/multi_table/test_base.py

fealho · 2022-11-04T18:25:33Z

tests/unit/multi_table/test_base.py

+
+        # Run and Assert
+        error_msg = re.escape(
+            'The provided data does not match the metadata:\n'


This error message is a little weird. Maybe the colons at the end are supposed to be dots?

This message comes from the InvalidDataError 🤔 I'm not sure if a new class is desired to be created with a different message. @amontanez24 any suggestions on that?

I think it's helpful to have the label for each individual table since that helps the user know where to look. Maybe we can get rid of the top line. @npatki do you have thoughts on this?

Is this for validating data against the metadata? The idea was to collect and print all the errors at once, which is why there was probably a colon?

>>> synthesizer.validate(data) InvalidDataError: The provided data does not match the metadata Error: Invalid values found for numerical column 'age': ('a', 'b', 'c', +more) Error: Key column 'user_id' contains missing values Error: Foreign key column 'purchaser_id' contains unknown references: ('Unknown', 'USER_999', 'ZZZ', +more). All the values in must reference a primary key.

The top line is for the overall error we throw (afaik, Python only allows you to throw 1 error at a time). The rest of the rows are the message.

Breaking up by table makes sense for cleanly summarizing where the errors happen. This could be similar to what we do in metadata.validate(): We split it up by table and then add a section for relationships.

>>> synthesizer.validate(data) InvalidDataError: The provided data does not match the metadata Table: 'sessions' Error: Invalid values found for numerical column 'age': ('a', 'b', 'c', +more) Table: 'users' Error: Key column 'user_id' contains missing values ... Relationships: Error: Foreign key column 'purchaser_id' contains unknown references: ('Unknown', 'USER_999', 'ZZZ', +more). All the values in must reference a primary key.

For context, I'm pasting below what we do for metadata.validate():

InvalidMetadataError: The metadata is not valid Table: 'users' Error: Invalid values ("pii") for datetime column "start_date". Error: A Unique constraint is being applied to column "age". This column is already a key for that table. Table: 'transactions' Error: Invalid regex format string "[A-{6}" for text column "transaction_id" Error: Unknown key value 'ttid'. Keys should be columns that exist in the table. Error: Invalid increment value (0.5) in a FixedIncrements constraint. Increments must be positive integers. Relationships: Error: Relationship between tables ('users', 'transactions') contains an unknown primary key 'userr_id'. Error: The relationships in the dataset are disjointed. Tables ('sessions') are not connected to any of the other tables.

sdv/multi_table/base.py

amontanez24 · 2022-11-07T20:08:28Z

sdv/multi_table/base.py

+        """Validate data.
+
+        Args:
+            data (dict):


@npatki I think this was set to data to keep consistency with the SingleTableSynthesizer method, but since it does take a different type as input maybe it would make sense to change?

sdv/multi_table/base.py

amontanez24 · 2022-11-07T20:16:17Z

tests/unit/multi_table/test_base.py

+
+        # Run and Assert
+        error_msg = re.escape(
+            'The provided data does not match the metadata:\n'


I think it's helpful to have the label for each individual table since that helps the user know where to look. Maybe we can get rid of the top line. @npatki do you have thoughts on this?

amontanez24

A question and some minor comments but otherwise LGTM!

amontanez24 · 2022-11-09T04:40:49Z

sdv/multi_table/base.py

-        return errors
+            child_table = data.get(relation['child_table_name'])
+            parent_table = data.get(relation['parent_table_name'])
+            if isinstance(child_table, pd.DataFrame) and isinstance(parent_table, pd.DataFrame):


why did we add the check for if they're dataframes?

Before I was raising the error if the check failed on the SingleTableSynthesizer , however, since we want to have all the errors raised at once, I have to check this otherwise it will fail as key error or something else.

sdv/multi_table/base.py

* Add fk validation * Curate message * Finish unit tests * Address comments * Fix error messaging * Address comments

pvk-developer added 3 commits November 3, 2022 19:07

Add fk validation

57cdb06

Curate message

fcec8da

Finish unit tests

53424b2

pvk-developer marked this pull request as ready for review November 4, 2022 17:11

pvk-developer requested a review from a team as a code owner November 4, 2022 17:11

pvk-developer requested review from fealho and amontanez24 and removed request for a team November 4, 2022 17:11

fealho reviewed Nov 4, 2022

View reviewed changes

amontanez24 requested changes Nov 7, 2022

View reviewed changes

pvk-developer added 2 commits November 8, 2022 14:48

Address comments

553b308

Fix error messaging

172a25f

amontanez24 approved these changes Nov 9, 2022

View reviewed changes

Address comments

bd0f932

fealho requested review from fealho and amontanez24 November 9, 2022 19:24

fealho approved these changes Nov 9, 2022

View reviewed changes

pvk-developer merged commit 8501f54 into V1.0.0.dev Nov 9, 2022

pvk-developer deleted the issue-1071-add-validate-to-basemultitablesynthesizer branch November 9, 2022 20:22

amontanez24 pushed a commit that referenced this pull request Nov 22, 2022

Add validate to BaseMultiTableSynthesizer (#1092)

fae3291

* Add fk validation * Curate message * Finish unit tests * Address comments * Fix error messaging * Address comments

amontanez24 pushed a commit that referenced this pull request Dec 21, 2022

Add validate to BaseMultiTableSynthesizer (#1092)

34005e5

* Add fk validation * Curate message * Finish unit tests * Address comments * Fix error messaging * Address comments

amontanez24 pushed a commit that referenced this pull request Feb 17, 2023

Add validate to BaseMultiTableSynthesizer (#1092)

0a0d041

* Add fk validation * Curate message * Finish unit tests * Address comments * Fix error messaging * Address comments

amontanez24 pushed a commit that referenced this pull request Mar 23, 2023

Add validate to BaseMultiTableSynthesizer (#1092)

db1c074

* Add fk validation * Curate message * Finish unit tests * Address comments * Fix error messaging * Address comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add validate to `BaseMultiTableSynthesizer` #1092

Add validate to `BaseMultiTableSynthesizer` #1092

pvk-developer commented Nov 4, 2022

codecov-commenter commented Nov 4, 2022 •

edited

Loading

fealho Nov 4, 2022

pvk-developer Nov 4, 2022

amontanez24 Nov 7, 2022

npatki Nov 7, 2022 •

edited

Loading

fealho Nov 4, 2022

pvk-developer Nov 4, 2022

amontanez24 Nov 7, 2022

npatki Nov 7, 2022

npatki Nov 7, 2022

amontanez24 Nov 7, 2022

amontanez24 Nov 7, 2022

amontanez24 left a comment

amontanez24 Nov 9, 2022

pvk-developer Nov 9, 2022

Add validate to BaseMultiTableSynthesizer #1092

Add validate to BaseMultiTableSynthesizer #1092

Conversation

pvk-developer commented Nov 4, 2022

codecov-commenter commented Nov 4, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npatki Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amontanez24 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add validate to `BaseMultiTableSynthesizer` #1092

Add validate to `BaseMultiTableSynthesizer` #1092

codecov-commenter commented Nov 4, 2022 •

edited

Loading

npatki Nov 7, 2022 •

edited

Loading