Seed fix #101

mikeknep · 2023-04-27T15:24:14Z

Fixes two bugs.

When using Ancestral strategy, we were previously basing the "highly NaN" and "highly unique categorical" calculations (to determine whether or not we include them in training and seeding) on the column data as it appeared in the joined ancestral, not the source parent table. A column could be highly unique in the source (e.g. team name in the NBA database Teams table), but as a result of the particular foreign key frequency on some child table it is possible to not hit the uniqueness threshold when appearing as a joined ancestral column. This led to a bug where we would train on that data and attempt to use it as a seed column, but the synthetic child data used as a seed would include categorical values not seen during training, ultimately resulting in having to throw away records as invalid conditional seeds.
When label encoding keys on tables, we were previously iterating through the tables in an arbitrary order. This led to a bug (possibly limited to composite keys): a child table with a composite PK made up of two individual FK columns could be label-encoded "in isolation" before the parent table is label-encoded, and as a result we lost referential integrity between the two. We now ensure that tables are label-encoded in proper order, i.e. parents before children.

Implementation note

The logic for determining if a column is highly NaN or highly unique categorical now exists in the RelationalData class itself, rather than being owned by the AncestralStrategy. One immediate benefit is we now perform the calculation just once for each table and cache the result, rather than recalculate it every time the table is joined onto a child table. Additionally, in the future there might be ways for us to get this metadata from a more sophisticated connector... if that is possible, the result would need to live in RelationalData anyways, so making this change now sets us up better for that possibility.

…approach

gracecvking

👍

mikeknep added 8 commits April 26, 2023 16:06

Cache table column names on RelationalData graph

a78a732

Cache safe ancestral seed columns on RelationalData

eaeae6a

You can ask for a subset (columns) of table data

62ec836

Prefactor: ancestry can scope to safe ancestral seed columns

6914b02

Remove now-superflous safe seed logic from strategy and use ancestry …

4db5fb5

…approach

Add method to list tables in parent-child order

24f32b9

Label-encode tables in parent-to-child order

f1cd432

Defer safe ancestral seed column calculation until actually needed

f42e77b

mikeknep requested review from tylersbray, gracecvking and amysteier April 27, 2023 15:24

Fix a comment

34cabf5

gracecvking approved these changes Apr 28, 2023

View reviewed changes

mikeknep merged commit c89fa08 into main Apr 28, 2023

mikeknep deleted the seed-fix branch April 28, 2023 20:43

mikeknep mentioned this pull request May 22, 2023

Preserve tables before training #111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed fix #101

Seed fix #101

mikeknep commented Apr 27, 2023

gracecvking left a comment

Seed fix #101

Seed fix #101

Conversation

mikeknep commented Apr 27, 2023

Implementation note

gracecvking left a comment

Choose a reason for hiding this comment