Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed fix #101

Merged
merged 9 commits into from
Apr 28, 2023
Merged

Seed fix #101

merged 9 commits into from
Apr 28, 2023

Conversation

mikeknep
Copy link
Contributor

Fixes two bugs.

  1. When using Ancestral strategy, we were previously basing the "highly NaN" and "highly unique categorical" calculations (to determine whether or not we include them in training and seeding) on the column data as it appeared in the joined ancestral, not the source parent table. A column could be highly unique in the source (e.g. team name in the NBA database Teams table), but as a result of the particular foreign key frequency on some child table it is possible to not hit the uniqueness threshold when appearing as a joined ancestral column. This led to a bug where we would train on that data and attempt to use it as a seed column, but the synthetic child data used as a seed would include categorical values not seen during training, ultimately resulting in having to throw away records as invalid conditional seeds.
  2. When label encoding keys on tables, we were previously iterating through the tables in an arbitrary order. This led to a bug (possibly limited to composite keys): a child table with a composite PK made up of two individual FK columns could be label-encoded "in isolation" before the parent table is label-encoded, and as a result we lost referential integrity between the two. We now ensure that tables are label-encoded in proper order, i.e. parents before children.

Implementation note

The logic for determining if a column is highly NaN or highly unique categorical now exists in the RelationalData class itself, rather than being owned by the AncestralStrategy. One immediate benefit is we now perform the calculation just once for each table and cache the result, rather than recalculate it every time the table is joined onto a child table. Additionally, in the future there might be ways for us to get this metadata from a more sophisticated connector... if that is possible, the result would need to live in RelationalData anyways, so making this change now sets us up better for that possibility.

Copy link
Contributor

@gracecvking gracecvking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mikeknep mikeknep merged commit c89fa08 into main Apr 28, 2023
@mikeknep mikeknep deleted the seed-fix branch April 28, 2023 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants