Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validate method to synthesizer #1014

Closed
amontanez24 opened this issue Sep 16, 2022 · 0 comments · Fixed by #1027
Closed

Add validate method to synthesizer #1014

amontanez24 opened this issue Sep 16, 2022 · 0 comments · Fixed by #1027
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

amontanez24 commented Sep 16, 2022

Problem Description

As a user, it would be helpful if I could check if my data was valid according to my metadata.

Expected behavior

  • Add a validate method to the BaseSynthesizer
  • If the method finds an errors, it should raise an InvalidDataError in the following format
synthesizer.validate(data)
InvalidDataError: The provided data does not match the metadata

Error: Invalid values found for numerical column 'age': ('a', 'b', 'c', +more)
  • The method should do the following checks
    • Columns described as sdtype=numerical should only have data that is numerical (or missing) – no other strings, etc. are allowed
      Error: Invalid values found for numerical column 'age': ('a', 'b', 'c', +more)
    • Columns described as sdtype=datetime should only have data that datetime or missing – try to convert everything to the datetime format and error if it fails
      Error: Invalid values found for datetime column 'start_date': (0.0, 30.0, 4.4, +more)
    • Columns described as sdtype=boolean should only have values that are: True, False or missing
      Error: Invalid values found for boolean column 'is_subscribed': (0.0, 30.0, 4.4, +more)
    • Columns marked as a key (primary, alternate, sequence, or foreign) should not have any missing values
      Error: Key column 'user_id' contains missing values
    • Columns marked as a primary or alternate keys should be unique in the table
      Error: Primary key column 'user_id' contains repeating values: ('UID_000', 'UID_001', 'UID_002', +more)
    • (Sequential only) Context columns (stored in the model's parameters) should be fixed for each sequence – ie if you group by sequence key, the context columns should not vary
      Error: Context column 'patient_address' is changing inside sequence ('Patient_ID'='ID_004').

Additional context

  • For the sequential case, we should override the method in the PARSynthesizer
@amontanez24 amontanez24 added the feature request Request for a new feature label Sep 16, 2022
@amontanez24 amontanez24 added this to the 1.0.0 milestone Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants