Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validate to BaseMultiTableSynthesizer #1092

Merged

Conversation

pvk-developer
Copy link
Member

Resolves #1071

@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2022

Codecov Report

Base: 81.06% // Head: 81.20% // Increases project coverage by +0.13% 🎉

Coverage data is based on head (53424b2) compared to base (3dea525).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head 53424b2 differs from pull request most recent head bd0f932. Consider uploading reports for the commit bd0f932 to get more accurate results

Additional details and impacted files
@@              Coverage Diff               @@
##           V1.0.0.dev    #1092      +/-   ##
==============================================
+ Coverage       81.06%   81.20%   +0.13%     
==============================================
  Files              57       57              
  Lines            4769     4804      +35     
==============================================
+ Hits             3866     3901      +35     
  Misses            903      903              
Impacted Files Coverage Δ
sdv/multi_table/base.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@pvk-developer pvk-developer marked this pull request as ready for review November 4, 2022 17:11
@pvk-developer pvk-developer requested a review from a team as a code owner November 4, 2022 17:11
@pvk-developer pvk-developer requested review from fealho and amontanez24 and removed request for a team November 4, 2022 17:11
"""Validate data.

Args:
data (dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call it tables. Data makes me think it's just a dataframe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with tables, I just followed the issue specification, @amontanez24 green light on the tables rather than data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npatki I think this was set to data to keep consistency with the SingleTableSynthesizer method, but since it does take a different type as input maybe it would make sense to change?

Copy link
Contributor

@npatki npatki Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the same issue with tables though -- it does not imply anything about this being a dictionary. In fact, I'd assume it's a list if I just saw the word.

To me, data does not imply a DataFrame but rather the general concept of any collected data. The nice thing about using data is that whenever we do a demo to future users, etc. it's very clear what the concept is (even if the exact type of argument is not clear)

sdv/multi_table/base.py Outdated Show resolved Hide resolved
tests/unit/multi_table/test_base.py Outdated Show resolved Hide resolved

# Run and Assert
error_msg = re.escape(
'The provided data does not match the metadata:\n'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is a little weird. Maybe the colons at the end are supposed to be dots?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message comes from the InvalidDataError 🤔 I'm not sure if a new class is desired to be created with a different message. @amontanez24 any suggestions on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's helpful to have the label for each individual table since that helps the user know where to look. Maybe we can get rid of the top line. @npatki do you have thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for validating data against the metadata? The idea was to collect and print all the errors at once, which is why there was probably a colon?

>>> synthesizer.validate(data)
InvalidDataError: The provided data does not match the metadata

Error: Invalid values found for numerical column 'age': ('a', 'b', 'c', +more)
Error: Key column 'user_id' contains missing values
Error: Foreign key column 'purchaser_id' contains unknown references: ('Unknown', 'USER_999', 'ZZZ', +more). All the values in must reference a primary key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top line is for the overall error we throw (afaik, Python only allows you to throw 1 error at a time). The rest of the rows are the message.

Breaking up by table makes sense for cleanly summarizing where the errors happen. This could be similar to what we do in metadata.validate(): We split it up by table and then add a section for relationships.

>>> synthesizer.validate(data)
InvalidDataError: The provided data does not match the metadata

Table: 'sessions'
Error: Invalid values found for numerical column 'age': ('a', 'b', 'c', +more)

Table: 'users'
Error: Key column 'user_id' contains missing values
...

Relationships:
Error: Foreign key column 'purchaser_id' contains unknown references: ('Unknown', 'USER_999', 'ZZZ', +more). All the values in must reference a primary key.

For context, I'm pasting below what we do for metadata.validate():

InvalidMetadataError: The metadata is not valid

Table: 'users'
Error: Invalid values ("pii") for datetime column "start_date".
Error: A Unique constraint is being applied to column "age". This column is already a key for that table.

Table: 'transactions'
Error: Invalid regex format string "[A-{6}" for text column "transaction_id"
Error: Unknown key value 'ttid'. Keys should be columns that exist in the table.
Error: Invalid increment value (0.5) in a FixedIncrements constraint. Increments must be positive integers.

Relationships:
Error: Relationship between tables ('users', 'transactions') contains an unknown primary key 'userr_id'.
Error: The relationships in the dataset are disjointed. Tables ('sessions') are not connected to any of the other tables.

sdv/multi_table/base.py Outdated Show resolved Hide resolved
"""Validate data.

Args:
data (dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npatki I think this was set to data to keep consistency with the SingleTableSynthesizer method, but since it does take a different type as input maybe it would make sense to change?

sdv/multi_table/base.py Outdated Show resolved Hide resolved
sdv/multi_table/base.py Outdated Show resolved Hide resolved
sdv/multi_table/base.py Outdated Show resolved Hide resolved

# Run and Assert
error_msg = re.escape(
'The provided data does not match the metadata:\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's helpful to have the label for each individual table since that helps the user know where to look. Maybe we can get rid of the top line. @npatki do you have thoughts on this?

Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question and some minor comments but otherwise LGTM!

return errors
child_table = data.get(relation['child_table_name'])
parent_table = data.get(relation['parent_table_name'])
if isinstance(child_table, pd.DataFrame) and isinstance(parent_table, pd.DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we add the check for if they're dataframes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I was raising the error if the check failed on the SingleTableSynthesizer , however, since we want to have all the errors raised at once, I have to check this otherwise it will fail as key error or something else.

sdv/multi_table/base.py Outdated Show resolved Hide resolved
sdv/multi_table/base.py Outdated Show resolved Hide resolved
@fealho fealho requested review from fealho and amontanez24 November 9, 2022 19:24
@pvk-developer pvk-developer merged commit 8501f54 into V1.0.0.dev Nov 9, 2022
@pvk-developer pvk-developer deleted the issue-1071-add-validate-to-basemultitablesynthesizer branch November 9, 2022 20:22
amontanez24 pushed a commit that referenced this pull request Nov 22, 2022
* Add fk validation

* Curate message

* Finish unit tests

* Address comments

* Fix error messaging

* Address comments
amontanez24 pushed a commit that referenced this pull request Dec 21, 2022
* Add fk validation

* Curate message

* Finish unit tests

* Address comments

* Fix error messaging

* Address comments
amontanez24 pushed a commit that referenced this pull request Feb 17, 2023
* Add fk validation

* Curate message

* Finish unit tests

* Address comments

* Fix error messaging

* Address comments
amontanez24 pushed a commit that referenced this pull request Mar 23, 2023
* Add fk validation

* Curate message

* Finish unit tests

* Address comments

* Fix error messaging

* Address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants