-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validate to BaseMultiTableSynthesizer
#1092
Add validate to BaseMultiTableSynthesizer
#1092
Conversation
Codecov ReportBase: 81.06% // Head: 81.20% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## V1.0.0.dev #1092 +/- ##
==============================================
+ Coverage 81.06% 81.20% +0.13%
==============================================
Files 57 57
Lines 4769 4804 +35
==============================================
+ Hits 3866 3901 +35
Misses 903 903
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
"""Validate data. | ||
|
||
Args: | ||
data (dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call it tables. Data makes me think it's just a dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with tables, I just followed the issue specification, @amontanez24 green light on the tables
rather than data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@npatki I think this was set to data
to keep consistency with the SingleTableSynthesizer
method, but since it does take a different type as input maybe it would make sense to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the same issue with tables
though -- it does not imply anything about this being a dictionary. In fact, I'd assume it's a list if I just saw the word.
To me, data
does not imply a DataFrame
but rather the general concept of any collected data. The nice thing about using data
is that whenever we do a demo to future users, etc. it's very clear what the concept is (even if the exact type of argument is not clear)
|
||
# Run and Assert | ||
error_msg = re.escape( | ||
'The provided data does not match the metadata:\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message is a little weird. Maybe the colons at the end are supposed to be dots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message comes from the InvalidDataError
🤔 I'm not sure if a new class is desired to be created with a different message. @amontanez24 any suggestions on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's helpful to have the label for each individual table since that helps the user know where to look. Maybe we can get rid of the top line. @npatki do you have thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for validating data against the metadata? The idea was to collect and print all the errors at once, which is why there was probably a colon?
>>> synthesizer.validate(data)
InvalidDataError: The provided data does not match the metadata
Error: Invalid values found for numerical column 'age': ('a', 'b', 'c', +more)
Error: Key column 'user_id' contains missing values
Error: Foreign key column 'purchaser_id' contains unknown references: ('Unknown', 'USER_999', 'ZZZ', +more). All the values in must reference a primary key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The top line is for the overall error we throw (afaik, Python only allows you to throw 1 error at a time). The rest of the rows are the message.
Breaking up by table makes sense for cleanly summarizing where the errors happen. This could be similar to what we do in metadata.validate()
: We split it up by table and then add a section for relationships.
>>> synthesizer.validate(data)
InvalidDataError: The provided data does not match the metadata
Table: 'sessions'
Error: Invalid values found for numerical column 'age': ('a', 'b', 'c', +more)
Table: 'users'
Error: Key column 'user_id' contains missing values
...
Relationships:
Error: Foreign key column 'purchaser_id' contains unknown references: ('Unknown', 'USER_999', 'ZZZ', +more). All the values in must reference a primary key.
For context, I'm pasting below what we do for metadata.validate()
:
InvalidMetadataError: The metadata is not valid
Table: 'users'
Error: Invalid values ("pii") for datetime column "start_date".
Error: A Unique constraint is being applied to column "age". This column is already a key for that table.
Table: 'transactions'
Error: Invalid regex format string "[A-{6}" for text column "transaction_id"
Error: Unknown key value 'ttid'. Keys should be columns that exist in the table.
Error: Invalid increment value (0.5) in a FixedIncrements constraint. Increments must be positive integers.
Relationships:
Error: Relationship between tables ('users', 'transactions') contains an unknown primary key 'userr_id'.
Error: The relationships in the dataset are disjointed. Tables ('sessions') are not connected to any of the other tables.
"""Validate data. | ||
|
||
Args: | ||
data (dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@npatki I think this was set to data
to keep consistency with the SingleTableSynthesizer
method, but since it does take a different type as input maybe it would make sense to change?
|
||
# Run and Assert | ||
error_msg = re.escape( | ||
'The provided data does not match the metadata:\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's helpful to have the label for each individual table since that helps the user know where to look. Maybe we can get rid of the top line. @npatki do you have thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question and some minor comments but otherwise LGTM!
return errors | ||
child_table = data.get(relation['child_table_name']) | ||
parent_table = data.get(relation['parent_table_name']) | ||
if isinstance(child_table, pd.DataFrame) and isinstance(parent_table, pd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we add the check for if they're dataframes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I was raising the error if the check failed on the SingleTableSynthesizer
, however, since we want to have all the errors raised at once, I have to check this otherwise it will fail as key error or something else.
* Add fk validation * Curate message * Finish unit tests * Address comments * Fix error messaging * Address comments
* Add fk validation * Curate message * Finish unit tests * Address comments * Fix error messaging * Address comments
* Add fk validation * Curate message * Finish unit tests * Address comments * Fix error messaging * Address comments
* Add fk validation * Curate message * Finish unit tests * Address comments * Fix error messaging * Address comments
Resolves #1071