-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve tables before training #111
Conversation
- ensure compatibility of foreign key values between what the table trains on and what gets seeded during generation - validate presence of all required tables at train time - instead of calculating and storing `missing_model` as a dedicated field on the SyntheticsRun state, we can caculate it from data in the SyntheticsTrain object - skip evaluation of tables omitted from synthetics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had questions but not concerns 🚢
Oh, this PR had a long README but I didn't see a ton of docstrings. And maybe as a bigger thing, is there a confluence page with an architecture/state diagram/strategy walkthrough...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG!
145686d
to
07bcdb1
Compare
Main user-facing change
The
train
method is deprecated in favor of the newtrain_synthetics
method. The latter accepts optionalonly | ignore
lists to limit which tables are trained—the idea here is that if some table contains static reference data that you don't want synthesized, you can omit it from the synthetics workflow entirely.The main reason for introducing this as a new method instead of optional arguments to the existing method is that the original
train
method resets the synthetic train state when called a second time. That felt like behavior a user might rely on, so I figured it'd be safest to leave it alone and introduce a new method (which, FYI, works "additively"—see this conversation discussing it in the context of Transforms). I also think the original method name is too generic and it's nicer to specify "synthetics" in the method name given what all Relational can do.Note: previously we let users skip table synthesis at
generate
time via thepreserve: list[str]
argument, but of course that method is called aftertrain
/train_synthetics
so we would have unnecessarily trained models for the preserved tables. Thepreserve
argument still exists ongenerate
and currently is not marked for deprecation—we want to see if any customers have a use case for it (e.g. I have 10 tables, I know 2 will never be synthesized so I omit them from training, later during generation there is one more table I want to preserve with source data instead of synthesizing. TBD.)Delete models methodThe same conversation with @pimlock on #109 got me thinking that we should have adelete_models
method—it seems like a useful thing to have alongside theonly | ignore
params. It was a pretty straightforward thing to add, but we could also skip it / remove it from this PR. @gracecvking any thoughts?EDIT: decided not to include this, can be added sometime in the future if/when a well-defined use case arrives.
Internal improvements
RelationalData
class. Here, I realized since that info is calculated and cached there, we don't need to carry it in the MultiTable state and backups—we can just ask for it again—sotraining_columns
is removed from several places.missing_model
for tables that didn't train successfully. However, since we now allow users to omit tables from training, we can no longer say "if we can't get a completed model, treat it as failed"—we now need to know if the table attempted training unsuccessfully, or was deliberately omitted. Fortunately, like above, we can calculate this from other data in state, somissing_model
is dropped.