Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for JSON #107

Merged
merged 62 commits into from
May 16, 2023
Merged

Support for JSON #107

merged 62 commits into from
May 16, 2023

Conversation

mikeknep
Copy link
Contributor

How it works

Whenever a table is added (via add_table), we check to see if any columns in the table contain structured data—this can be actual dictionaries/lists or a valid JSON string. If it does, we invent 1-N tables that flatten the nested data by lifting dictionary keys to standalone columns and moving lists to new child tables. We train using all the flattened tables, and at the very end reassemble generated data to the original nested format.

What's user-facing?

This is designed to be nearly completely transparent to users, but there are some public-facing things to note:

  • In Console, you'll see models for invented tables
  • In the Relational Report, we do not show any invented tables. The SQS score comes from running Evaluate on the root invented table only. Same for the evaluations dict property on MultiTable.
  • The list_all_tables method now takes an optional "scope" parameter to return different subsets of the known tables. By default we will expose the invented table names and not the original table name since this method is used so frequently by MultiTable (which generally needs exactly that), but users can get their original table names by passing scope="public" | scope=Scope.PUBLIC.

All the manual modification methods (e.g. set_primary_key, remove_foreign_key, etc.) have been updated to work correctly when referencing a source-with-JSON table. The public interface is unchanged. (This accounts for the majority of the changes in the relational.core module.)

src/gretel_trainer/relational/json.py Outdated Show resolved Hide resolved
Copy link
Contributor

@pimlock pimlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this PR is dense :)
It looks great! I got through most of it, added some nits and I think it would be easier to continue by walking through this together. I'm going to set up some time next week.

src/gretel_trainer/relational/core.py Outdated Show resolved Hide resolved
src/gretel_trainer/relational/core.py Show resolved Hide resolved
src/gretel_trainer/relational/backup.py Show resolved Hide resolved
Copy link
Contributor

@pimlock pimlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Thanks for adding some logging, I think we could add more debug over time (even around core relational), at least for me, it would make it easier to understand what's happening when I run it.

src/gretel_trainer/relational/backup.py Show resolved Hide resolved
@mikeknep mikeknep merged commit 58e8891 into main May 16, 2023
@mikeknep mikeknep deleted the json branch May 16, 2023 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants