Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add InterTable Trends property #451

Closed
amontanez24 opened this issue Sep 15, 2023 · 0 comments · Fixed by #456
Closed

Add InterTable Trends property #451

amontanez24 opened this issue Sep 15, 2023 · 0 comments · Fixed by #456
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

Problem Description

As a user, I'd like to have information on how well my synthesized data maintained relationships between different tables.

Expected behavior

  • Add a new multi table property class called InterTableTrends.
  • Add this property to the multi table quality report

Score calculation

This property is similar in principle to the ColumnPairTrends property, except it compares columns in different tables. In order to do this, we propose the following procedure:

  1. For every relationship listed in the metadata, the property should merge on the primary/foreign keys in order to create a denormalized table. We do not need to merge any non-statistical columns (PII).
    1. Note that in the denormalized table, the parent table's values will repeat. For example since one user has many transactions, the user's info will be repeated for each transaction.
user.id user.age user.gender transaction.amt transaction.date ...
0 22 F 23.50 2/12/23
0 22 F 12.50 4/29/22
0 22 F 100.00 8/31/23
1 45 M 50.50 3/21/22
1 45 M 500.00 8/23/22
  1. On the denormalized dataset, we can then compute the Column Pair Trends as we would on a single table.
    1. Notable exception: Do NOT compute Column Pair Trends for 2 columns that come from the same table (eg. user.age and user.gender). Because we will already have results for this!
  2. The property score will be the average of all these scores.

Note: Performance is important! Ideally, the addition of this property should not add significant time to the computation. If it does, then we may want to consider subsetting the data that we use for the computation.

Methods

  • get_score(self, real_data, synthetic_data, metadata, progress_bar=None): Return the overall score.
    • This method is probably where the logic above will go.
  • get_details(table_name=None): This will return a dataframe with all the details. It should have the following columns:
    • Parent Table - Name of the parent table.
    • Child Table - Name of the child table.
    • Foreign Key - The name of the foreign key column.
    • Column 1 - Name of column in the parent table being compared.
    • Column 2 - Name of column in the child table being compared.
    • Metric - The metric to compare the columns.
    • Score - The score.
    • Real Correlation - The correlation in the real data.
    • Synthetic Correlation - The correlation in the synthetic data.
  • get_visualization(table_name=None): Create a simple bar graph that shows the final score between every pair of columns (similar to the bar graph for Column Shapes).

  • Label the bars with the names of the table(s) as well as the names of the columns
    Eg. "user.age, transaction.amt"
  • The color of the bars can signify which metric we used, similar to Column Shapes
  • On hover, there should be more details about the correlations. It should be like a tooltip with the following information.
user.age, transaction.amt
Foreign Key: user_id


Metric=CorrelationSimilarity
Score=0.28123123
Real Correlation=0.5845345
Synthetic Correlation=0.345345123
  • Note that table_name is required, but we need to consider all relationships that table is involved with (either a parent or as a child).

Additional context

  • The new property should also have a progress bar that increments by 1 every time we compute a pairwise trend.
    • Either the base _get_num_iterations will have to change or it will need to be overridden in this class because none of the current options match this scenario.
      def _get_num_iterations(self, metadata):
      """Get the number of iterations for the property."""
      if self._num_iteration_case == 'column':
      return sum(len(metadata['tables'][table]['columns']) for table in metadata['tables'])
      elif self._num_iteration_case == 'table':
      return len(metadata['tables'])
      elif self._num_iteration_case == 'relationship':
      return len(metadata['relationships'])
      elif self._num_iteration_case == 'column_pair':
      num_columns = [len(table['columns']) for table in metadata['tables'].values()]
      return sum([(n_cols * (n_cols - 1)) // 2 for n_cols in num_columns])
@amontanez24 amontanez24 added the feature request Request for a new feature label Sep 15, 2023
@amontanez24 amontanez24 added this to the 0.12.0 milestone Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants