Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve tables before training #111

Merged
merged 14 commits into from
May 30, 2023
Merged

Conversation

mikeknep
Copy link
Contributor

@mikeknep mikeknep commented May 22, 2023

Main user-facing change

The train method is deprecated in favor of the new train_synthetics method. The latter accepts optional only | ignore lists to limit which tables are trained—the idea here is that if some table contains static reference data that you don't want synthesized, you can omit it from the synthetics workflow entirely.

The main reason for introducing this as a new method instead of optional arguments to the existing method is that the original train method resets the synthetic train state when called a second time. That felt like behavior a user might rely on, so I figured it'd be safest to leave it alone and introduce a new method (which, FYI, works "additively"—see this conversation discussing it in the context of Transforms). I also think the original method name is too generic and it's nicer to specify "synthetics" in the method name given what all Relational can do.

Note: previously we let users skip table synthesis at generate time via the preserve: list[str] argument, but of course that method is called after train/train_synthetics so we would have unnecessarily trained models for the preserved tables. The preserve argument still exists on generate and currently is not marked for deprecation—we want to see if any customers have a use case for it (e.g. I have 10 tables, I know 2 will never be synthesized so I omit them from training, later during generation there is one more table I want to preserve with source data instead of synthesizing. TBD.)

Delete models method

The same conversation with @pimlock on #109 got me thinking that we should have a delete_models method—it seems like a useful thing to have alongside the only | ignore params. It was a pretty straightforward thing to add, but we could also skip it / remove it from this PR. @gracecvking any thoughts?

EDIT: decided not to include this, can be added sometime in the future if/when a well-defined use case arrives.

Internal improvements

  • Previously we had been keeping track of which columns were used in model training so that the ancestral strategy could be sure not to include those columns in seed data. In Seed fix #101 we moved the logic of which columns can be used in seeding to the core RelationalData class. Here, I realized since that info is calculated and cached there, we don't need to carry it in the MultiTable state and backups—we can just ask for it again—so training_columns is removed from several places.
  • Previously we had a dedicated field related to synthetics generate state called missing_model for tables that didn't train successfully. However, since we now allow users to omit tables from training, we can no longer say "if we can't get a completed model, treat it as failed"—we now need to know if the table attempted training unsuccessfully, or was deliberately omitted. Fortunately, like above, we can calculate this from other data in state, so missing_model is dropped.
  • We now create and upload the training data archive file before submitting and waiting for the model training jobs

mikeknep added 9 commits May 19, 2023 09:32
- ensure compatibility of foreign key values between what the table
  trains on and what gets seeded during generation
- validate presence of all required tables at train time
- instead of calculating and storing `missing_model` as a dedicated
  field on the SyntheticsRun state, we can caculate it from data in the
  SyntheticsTrain object
- skip evaluation of tables omitted from synthetics
Copy link
Contributor

@gracecvking gracecvking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@tylersbray tylersbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had questions but not concerns 🚢

Oh, this PR had a long README but I didn't see a ton of docstrings. And maybe as a bigger thing, is there a confluence page with an architecture/state diagram/strategy walkthrough...?

src/gretel_trainer/relational/multi_table.py Outdated Show resolved Hide resolved
src/gretel_trainer/relational/multi_table.py Show resolved Hide resolved
src/gretel_trainer/relational/multi_table.py Show resolved Hide resolved
src/gretel_trainer/relational/multi_table.py Show resolved Hide resolved
src/gretel_trainer/relational/multi_table.py Show resolved Hide resolved
Copy link
Contributor

@pimlock pimlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG!

@mikeknep mikeknep force-pushed the preserve-tables-before-training branch from 145686d to 07bcdb1 Compare May 30, 2023 16:12
@mikeknep mikeknep merged commit c08b3d9 into main May 30, 2023
@mikeknep mikeknep deleted the preserve-tables-before-training branch May 30, 2023 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants