Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Quality Tests for Transformers #252

Closed
csala opened this issue Sep 24, 2021 · 2 comments
Closed

Implement Quality Tests for Transformers #252

csala opened this issue Sep 24, 2021 · 2 comments
Assignees
Labels
internal The issue doesn't change the API or functionality
Milestone

Comments

@csala
Copy link
Contributor

csala commented Sep 24, 2021

We should implement a new type of tests that evaluate the quality of the Transformers

  • compute the score mentioned in Create validate_transformer_quality function #253 by looking at how good the column distributions and the correlations with other columns are captured after using the transformer, as well as how good the data that the SDV models generate when they use it, and
  • make a test that fails if the obtained score is below a predefined threshold. We may have to decide how this threshold is decided

The details about how this will be achieved and implemented can be discussed below and this description will be updated with the details once everything is clear.

@csala csala added internal The issue doesn't change the API or functionality needs discussion labels Sep 24, 2021
@amontanez24
Copy link
Contributor

amontanez24 commented Oct 14, 2021

The approach agreed upon is as follows:

  1. We create the test cases by going through a collection of datasets and selecting the ones that have the desired data_types and are under a certain size
  2. Each test case consists of the following:
    • The name of the dataset and table to download
    • The data type to test against
  3. For each test case we do the following:
    • get all the transformers of the test's data type.
    • For each transformer, we transform all of the columns of that type using that transformer
    • For each numerical column in the data, the transformed data is used as a set of features for a LinearRegression model that will attempt to predict it
    • A DataFrame of the coefficient of determination received for each transformer when trying to predict each numerical column is created.
  4. For each datatype, we create a DataFrame containing columns for the transformer name, dataset name, name of the column predicted and coefficient of determination found in the previous step.
  5. The tables created above are used to make one final results table. The results table has the following columns:
    • Dataset name
    • Transformer name
    • Transformer's average score for the columns in that dataset (Only columns that yielded an average score above a predetermined determined threshold are included).
    • Transformer's average score for the columns in that dataset relative to the other transformers' average score (again, only columns above the threshold are counted).
  6. Finally, a subtest is done for each transformer to assert that its relative score is higher than a cutoff, or if it is the only transformer for a datatype, that its average score is above the cutoff.

@amontanez24
Copy link
Contributor

In the PR (#287) the thresholds were set kind of low to accommodate for transformers that aren't designed for quality. This will be changed in this issue: #296

@amontanez24 amontanez24 self-assigned this Oct 26, 2021
@amontanez24 amontanez24 added this to the 0.6.0 milestone Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal The issue doesn't change the API or functionality
Projects
None yet
Development

No branches or pull requests

2 participants