You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a user, I'd like to have information on how well my synthesized data maintained relationships between different tables.
Expected behavior
Add a new multi table property class called InterTableTrends.
Add this property to the multi table quality report
Score calculation
This property is similar in principle to the ColumnPairTrends property, except it compares columns in different tables. In order to do this, we propose the following procedure:
For every relationship listed in the metadata, the property should merge on the primary/foreign keys in order to create a denormalized table. We do not need to merge any non-statistical columns (PII).
Note that in the denormalized table, the parent table's values will repeat. For example since one user has many transactions, the user's info will be repeated for each transaction.
user.id
user.age
user.gender
transaction.amt
transaction.date
...
0
22
F
23.50
2/12/23
…
0
22
F
12.50
4/29/22
…
0
22
F
100.00
8/31/23
…
1
45
M
50.50
3/21/22
…
1
45
M
500.00
8/23/22
…
On the denormalized dataset, we can then compute the Column Pair Trends as we would on a single table.
Notable exception: Do NOT compute Column Pair Trends for 2 columns that come from the same table (eg. user.age and user.gender). Because we will already have results for this!
The property score will be the average of all these scores.
Note: Performance is important! Ideally, the addition of this property should not add significant time to the computation. If it does, then we may want to consider subsetting the data that we use for the computation.
Methods
get_score(self, real_data, synthetic_data, metadata, progress_bar=None): Return the overall score.
This method is probably where the logic above will go.
get_details(table_name=None): This will return a dataframe with all the details. It should have the following columns:
Parent Table - Name of the parent table.
Child Table - Name of the child table.
Foreign Key - The name of the foreign key column.
Column 1 - Name of column in the parent table being compared.
Column 2 - Name of column in the child table being compared.
Metric - The metric to compare the columns.
Score - The score.
Real Correlation - The correlation in the real data.
Synthetic Correlation - The correlation in the synthetic data.
get_visualization(table_name=None): Create a simple bar graph that shows the final score between every pair of columns (similar to the bar graph for Column Shapes).
Label the bars with the names of the table(s) as well as the names of the columns
Eg. "user.age, transaction.amt"
The color of the bars can signify which metric we used, similar to Column Shapes
On hover, there should be more details about the correlations. It should be like a tooltip with the following information.
Note that table_name is required, but we need to consider all relationships that table is involved with (either a parent or as a child).
Additional context
The new property should also have a progress bar that increments by 1 every time we compute a pairwise trend.
Either the base _get_num_iterations will have to change or it will need to be overridden in this class because none of the current options match this scenario.
Problem Description
As a user, I'd like to have information on how well my synthesized data maintained relationships between different tables.
Expected behavior
InterTableTrends
.Score calculation
This property is similar in principle to the
ColumnPairTrends
property, except it compares columns in different tables. In order to do this, we propose the following procedure:Note: Performance is important! Ideally, the addition of this property should not add significant time to the computation. If it does, then we may want to consider subsetting the data that we use for the computation.
Methods
get_score(self, real_data, synthetic_data, metadata, progress_bar=None)
: Return the overall score.get_details(table_name=None)
: This will return a dataframe with all the details. It should have the following columns:get_visualization(table_name=None)
: Create a simple bar graph that shows the final score between every pair of columns (similar to the bar graph for Column Shapes).Eg. "user.age, transaction.amt"
table_name
is required, but we need to consider all relationships that table is involved with (either a parent or as a child).Additional context
_get_num_iterations
will have to change or it will need to be overridden in this class because none of the current options match this scenario.SDMetrics/sdmetrics/reports/multi_table/_properties/base.py
Lines 29 to 39 in 3667b37
The text was updated successfully, but these errors were encountered: