Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validate to BaseMultiTableSynthesizer #1092

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions sdv/multi_table/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,8 @@ def _validate_foreign_keys(self, data):

errors.append(
f"Error: foreign key column '{relation['child_foreign_key']}' contains "
f'unknown references: {message}'
f'unknown references: {message} All the values in this column must '
amontanez24 marked this conversation as resolved.
Show resolved Hide resolved
'reference a primary key.'
)
if errors:
error_msg = 'Relationships:\n'
Expand Down Expand Up @@ -142,12 +143,16 @@ def validate(self, data):
self._table_synthesizers[table_name].validate(table_data)

except (InvalidDataError, ValueError) as error:
error_msg = str(error)
error_msg = error_msg.replace(
'The provided data does not match the metadata:',
f"Table: '{table_name}'"
)
if isinstance(error, InvalidDataError):
error_msg = f"Table: '{table_name}'"
for _error in error.errors:
error_msg += f'\nError: {_error}'

else:
error_msg = str(error)
amontanez24 marked this conversation as resolved.
Show resolved Hide resolved

errors.append(error_msg)

except KeyError:
continue

Expand Down
16 changes: 10 additions & 6 deletions tests/unit/multi_table/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ def test__validate_foreign_keys(self):
result = instance._validate_foreign_keys(data)

# Assert
assert result == None
assert result is None

def test__validate_foreign_keys_missing_keys(self):
"""Test that errors are being returned.
Expand Down Expand Up @@ -198,8 +198,10 @@ def test__validate_foreign_keys_missing_keys(self):
missing_upravna_enota = (
'Relationships:\n'
"Error: foreign key column 'upravna_enota' contains unknown references: "
'(10, 11, 12, 13, 14, + more).\n'
'(10, 11, 12, 13, 14, + more). '
'All the values in this column must reference a primary key.\n'
"Error: foreign key column 'id_nesreca' contains unknown references: (1, 3, 5, 7, 9)."
' All the values in this column must reference a primary key.'
)
assert result == missing_upravna_enota

Expand Down Expand Up @@ -300,12 +302,13 @@ def test_validate_data_does_not_match(self):
error_msg = re.escape(
'The provided data does not match the metadata:\n'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is a little weird. Maybe the colons at the end are supposed to be dots?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message comes from the InvalidDataError 🤔 I'm not sure if a new class is desired to be created with a different message. @amontanez24 any suggestions on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's helpful to have the label for each individual table since that helps the user know where to look. Maybe we can get rid of the top line. @npatki do you have thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for validating data against the metadata? The idea was to collect and print all the errors at once, which is why there was probably a colon?

>>> synthesizer.validate(data)
InvalidDataError: The provided data does not match the metadata

Error: Invalid values found for numerical column 'age': ('a', 'b', 'c', +more)
Error: Key column 'user_id' contains missing values
Error: Foreign key column 'purchaser_id' contains unknown references: ('Unknown', 'USER_999', 'ZZZ', +more). All the values in must reference a primary key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top line is for the overall error we throw (afaik, Python only allows you to throw 1 error at a time). The rest of the rows are the message.

Breaking up by table makes sense for cleanly summarizing where the errors happen. This could be similar to what we do in metadata.validate(): We split it up by table and then add a section for relationships.

>>> synthesizer.validate(data)
InvalidDataError: The provided data does not match the metadata

Table: 'sessions'
Error: Invalid values found for numerical column 'age': ('a', 'b', 'c', +more)

Table: 'users'
Error: Key column 'user_id' contains missing values
...

Relationships:
Error: Foreign key column 'purchaser_id' contains unknown references: ('Unknown', 'USER_999', 'ZZZ', +more). All the values in must reference a primary key.

For context, I'm pasting below what we do for metadata.validate():

InvalidMetadataError: The metadata is not valid

Table: 'users'
Error: Invalid values ("pii") for datetime column "start_date".
Error: A Unique constraint is being applied to column "age". This column is already a key for that table.

Table: 'transactions'
Error: Invalid regex format string "[A-{6}" for text column "transaction_id"
Error: Unknown key value 'ttid'. Keys should be columns that exist in the table.
Error: Invalid increment value (0.5) in a FixedIncrements constraint. Increments must be positive integers.

Relationships:
Error: Relationship between tables ('users', 'transactions') contains an unknown primary key 'userr_id'.
Error: The relationships in the dataset are disjointed. Tables ('sessions') are not connected to any of the other tables.

"Table: 'nesreca'\n"
"Invalid values found for numerical column 'id_nesreca': ['0', '1', '2', '+ 7 more']."
"Error: Invalid values found for numerical column 'id_nesreca': ['0', '1', '2', "
"'+ 7 more']."
"\n\nTable: 'oseba'\n"
"Invalid values found for numerical column 'upravna_enota': ['0', '1', '2', "
"Error: Invalid values found for numerical column 'upravna_enota': ['0', '1', '2', "
"'+ 7 more']."
"\n\nTable: 'upravna_enota'\n"
"Invalid values found for numerical column 'id_upravna_enota': ['0', '1', '2', "
"Error: Invalid values found for numerical column 'id_upravna_enota': ['0', '1', '2', "
"'+ 7 more']."
)
with pytest.raises(InvalidDataError, match=error_msg):
Expand Down Expand Up @@ -334,7 +337,8 @@ def test_validate_missing_foreign_keys(self):
error_msg = re.escape(
'The provided data does not match the metadata:\n'
'Relationships:\n'
"Error: foreign key column 'id_nesreca' contains unknown references: (1, 3, 5, 7, 9)."
"Error: foreign key column 'id_nesreca' contains unknown references: (1, 3, 5, 7, 9). "
'All the values in this column must reference a primary key.'
)
with pytest.raises(InvalidDataError, match=error_msg):
instance.validate(data)