You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should implement a new type of tests that evaluate the quality of the Transformers
compute the score mentioned in Create validate_transformer_quality function #253 by looking at how good the column distributions and the correlations with other columns are captured after using the transformer, as well as how good the data that the SDV models generate when they use it, and
make a test that fails if the obtained score is below a predefined threshold. We may have to decide how this threshold is decided
The details about how this will be achieved and implemented can be discussed below and this description will be updated with the details once everything is clear.
The text was updated successfully, but these errors were encountered:
We create the test cases by going through a collection of datasets and selecting the ones that have the desired data_types and are under a certain size
Each test case consists of the following:
The name of the dataset and table to download
The data type to test against
For each test case we do the following:
get all the transformers of the test's data type.
For each transformer, we transform all of the columns of that type using that transformer
For each numerical column in the data, the transformed data is used as a set of features for a LinearRegression model that will attempt to predict it
A DataFrame of the coefficient of determination received for each transformer when trying to predict each numerical column is created.
For each datatype, we create a DataFrame containing columns for the transformer name, dataset name, name of the column predicted and coefficient of determination found in the previous step.
The tables created above are used to make one final results table. The results table has the following columns:
Dataset name
Transformer name
Transformer's average score for the columns in that dataset (Only columns that yielded an average score above a predetermined determined threshold are included).
Transformer's average score for the columns in that dataset relative to the other transformers' average score (again, only columns above the threshold are counted).
Finally, a subtest is done for each transformer to assert that its relative score is higher than a cutoff, or if it is the only transformer for a datatype, that its average score is above the cutoff.
In the PR (#287) the thresholds were set kind of low to accommodate for transformers that aren't designed for quality. This will be changed in this issue: #296
We should implement a new type of tests that evaluate the quality of the Transformers
score
mentioned in Createvalidate_transformer_quality
function #253 by looking at how good the column distributions and the correlations with other columns are captured after using the transformer, as well as how good the data that the SDV models generate when they use it, andscore
is below a predefined threshold. We may have to decide how this threshold is decidedThe details about how this will be achieved and implemented can be discussed below and this description will be updated with the details once everything is clear.
The text was updated successfully, but these errors were encountered: