Implement Quality Tests for Transformers #252

csala · 2021-09-24T16:43:50Z

We should implement a new type of tests that evaluate the quality of the Transformers

compute the score mentioned in Create validate_transformer_quality function #253 by looking at how good the column distributions and the correlations with other columns are captured after using the transformer, as well as how good the data that the SDV models generate when they use it, and
make a test that fails if the obtained score is below a predefined threshold. We may have to decide how this threshold is decided

The details about how this will be achieved and implemented can be discussed below and this description will be updated with the details once everything is clear.

The text was updated successfully, but these errors were encountered:

amontanez24 · 2021-10-14T01:31:42Z

The approach agreed upon is as follows:

We create the test cases by going through a collection of datasets and selecting the ones that have the desired data_types and are under a certain size
Each test case consists of the following:
- The name of the dataset and table to download
- The data type to test against
For each test case we do the following:
- get all the transformers of the test's data type.
- For each transformer, we transform all of the columns of that type using that transformer
- For each numerical column in the data, the transformed data is used as a set of features for a LinearRegression model that will attempt to predict it
- A DataFrame of the coefficient of determination received for each transformer when trying to predict each numerical column is created.
For each datatype, we create a DataFrame containing columns for the transformer name, dataset name, name of the column predicted and coefficient of determination found in the previous step.
The tables created above are used to make one final results table. The results table has the following columns:
- Dataset name
- Transformer name
- Transformer's average score for the columns in that dataset (Only columns that yielded an average score above a predetermined determined threshold are included).
- Transformer's average score for the columns in that dataset relative to the other transformers' average score (again, only columns above the threshold are counted).
Finally, a subtest is done for each transformer to assert that its relative score is higher than a cutoff, or if it is the only transformer for a datatype, that its average score is above the cutoff.

amontanez24 · 2021-10-18T19:59:43Z

In the PR (#287) the thresholds were set kind of low to accommodate for transformers that aren't designed for quality. This will be changed in this issue: #296

csala added internal The issue doesn't change the API or functionality needs discussion labels Sep 24, 2021

csala mentioned this issue Sep 24, 2021

Create validate_transformer_quality function #253

Closed

amontanez24 mentioned this issue Oct 14, 2021

Rdt quality tests #287

Merged

amontanez24 self-assigned this Oct 26, 2021

amontanez24 added this to the 0.6.0 milestone Oct 26, 2021

amontanez24 closed this as completed Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Quality Tests for Transformers #252

Implement Quality Tests for Transformers #252

csala commented Sep 24, 2021 •

edited by kveerama

Loading

amontanez24 commented Oct 14, 2021 •

edited

Loading

amontanez24 commented Oct 18, 2021

Implement Quality Tests for Transformers #252

Implement Quality Tests for Transformers #252

Comments

csala commented Sep 24, 2021 • edited by kveerama Loading

amontanez24 commented Oct 14, 2021 • edited Loading

amontanez24 commented Oct 18, 2021

csala commented Sep 24, 2021 •

edited by kveerama

Loading

amontanez24 commented Oct 14, 2021 •

edited

Loading