Preserve tables before training #111

mikeknep · 2023-05-22T16:22:03Z

Main user-facing change

The train method is deprecated in favor of the new train_synthetics method. The latter accepts optional only | ignore lists to limit which tables are trained—the idea here is that if some table contains static reference data that you don't want synthesized, you can omit it from the synthetics workflow entirely.

The main reason for introducing this as a new method instead of optional arguments to the existing method is that the original train method resets the synthetic train state when called a second time. That felt like behavior a user might rely on, so I figured it'd be safest to leave it alone and introduce a new method (which, FYI, works "additively"—see this conversation discussing it in the context of Transforms). I also think the original method name is too generic and it's nicer to specify "synthetics" in the method name given what all Relational can do.

Note: previously we let users skip table synthesis at generate time via the preserve: list[str] argument, but of course that method is called after train/train_synthetics so we would have unnecessarily trained models for the preserved tables. The preserve argument still exists on generate and currently is not marked for deprecation—we want to see if any customers have a use case for it (e.g. I have 10 tables, I know 2 will never be synthesized so I omit them from training, later during generation there is one more table I want to preserve with source data instead of synthesizing. TBD.)

Delete models method

The same conversation with @pimlock on #109 got me thinking that we should have a delete_models method—it seems like a useful thing to have alongside the only | ignore params. It was a pretty straightforward thing to add, but we could also skip it / remove it from this PR. @gracecvking any thoughts?

EDIT: decided not to include this, can be added sometime in the future if/when a well-defined use case arrives.

Internal improvements

Previously we had been keeping track of which columns were used in model training so that the ancestral strategy could be sure not to include those columns in seed data. In Seed fix #101 we moved the logic of which columns can be used in seeding to the core RelationalData class. Here, I realized since that info is calculated and cached there, we don't need to carry it in the MultiTable state and backups—we can just ask for it again—so training_columns is removed from several places.
Previously we had a dedicated field related to synthetics generate state called missing_model for tables that didn't train successfully. However, since we now allow users to omit tables from training, we can no longer say "if we can't get a completed model, treat it as failed"—we now need to know if the table attempted training unsuccessfully, or was deliberately omitted. Fortunately, like above, we can calculate this from other data in state, so missing_model is dropped.
We now create and upload the training data archive file before submitting and waiting for the model training jobs

…hem in state

- ensure compatibility of foreign key values between what the table trains on and what gets seeded during generation - validate presence of all required tables at train time - instead of calculating and storing `missing_model` as a dedicated field on the SyntheticsRun state, we can caculate it from data in the SyntheticsTrain object - skip evaluation of tables omitted from synthetics

gracecvking

LGTM

tylersbray

I had questions but not concerns 🚢

Oh, this PR had a long README but I didn't see a ton of docstrings. And maybe as a bigger thing, is there a confluence page with an architecture/state diagram/strategy walkthrough...?

src/gretel_trainer/relational/multi_table.py

pimlock

LG!

mikeknep added 9 commits May 19, 2023 09:32

New train_synthetics with only/ignore

bf56146

Add method to delete models

7d574e5

Add method to calculate seed-safe columns so we don't need to store t…

3659a6a

…hem in state

Archive training resources before starting task

31c16e4

Don't prepare training data for tables the user is omitting

05dfd24

wip: correctly handle omitted tables during ancestral training data prep

f584cf0

Wordsmith some comments and docstring

28626af

Add deprecation warning to train method

8360d3a

mikeknep requested review from pimlock, tylersbray and gracecvking May 22, 2023 16:22

gracecvking approved these changes May 23, 2023

View reviewed changes

tylersbray approved these changes May 25, 2023

View reviewed changes

pimlock reviewed May 26, 2023

View reviewed changes

src/gretel_trainer/relational/multi_table.py Outdated Show resolved Hide resolved

pimlock approved these changes May 26, 2023

View reviewed changes

mikeknep added 2 commits May 30, 2023 11:07

Use sets for only/ignore

db12169

Add test extending table filters to JSON/invented tables

07bcdb1

mikeknep force-pushed the preserve-tables-before-training branch from 145686d to 07bcdb1 Compare May 30, 2023 16:12

mikeknep added 3 commits May 30, 2023 11:27

Remove delete_models method

6b6d76b

Add some docstrings

cb9c3a1

Improve clarity of get_only_and_ignore

6b990ab

mikeknep merged commit c08b3d9 into main May 30, 2023

mikeknep deleted the preserve-tables-before-training branch May 30, 2023 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve tables before training #111

Preserve tables before training #111

mikeknep commented May 22, 2023 •

edited

Loading

gracecvking left a comment

tylersbray left a comment

pimlock left a comment

Preserve tables before training #111

Preserve tables before training #111

Conversation

mikeknep commented May 22, 2023 • edited Loading

Main user-facing change

Delete models method

Internal improvements

gracecvking left a comment

Choose a reason for hiding this comment

tylersbray left a comment

Choose a reason for hiding this comment

pimlock left a comment

Choose a reason for hiding this comment

mikeknep commented May 22, 2023 •

edited

Loading