Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: nested json table string length #135

Merged
merged 6 commits into from
Jul 17, 2023
Merged

Conversation

benmccown
Copy link
Contributor

In cases where we are parsing nested json with lengthy key strings the combination of key strings in the breadcrumb based table name string would result in an error when trying to train the model due to a maximum json filename length of 128 chars. Here we utilize the already existing make_suffix function (which ensures no sanitized_str collisions) and we truncate the length of the table name to ensure we do not exceed the 128 char limit.

In cases where we are parsing nested json with lengthy key strings the
combination of key strings in the breadcrumb based table name string
would result in an error when trying to train the model due to a maximum
json filename length of 128 chars. Here we utilize the already existing
make_suffix function (which ensures no sanitized_str collisions) and we
truncate the length of the table name to ensure we do not exceed the 128
char limit.
@benmccown benmccown requested a review from mikeknep July 12, 2023 22:19
@benmccown
Copy link
Contributor Author

Let me know if I am missing anything in the unit test coverage. I think the tests I added should cover us but I wasn't sure if there were any other existing tests that would benefit from having the real sanitize_str method included in their run instead of the mocked function.

Instead of sanitizing the full json path to an invented table, we create
a table name based on the producer table with a uuid to prevent
collisions, and we map the full json path to the unique table string.
When nested json structures are extracted to individual dataframes they
will be written to resulting individual tables. Since we are creating
table names with the convention <producer_table>_invented_<uuid> we now
store the mapping details for tables in the producer table's debug
summary, as well as the json breadcrumb path in each invented table's
individual debug summary.
@benmccown
Copy link
Contributor Author

The latest commits incorporate feedback gathered via slack to name invented tables <producer_table>_invented_<uuid>. We also stored the resulting mapping info in the debug_summary file now.

Copy link
Contributor

@mikeknep mikeknep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Most of these suggestions come down to style / lesser-known features and functions; there is one test change that is confusing me.

src/gretel_trainer/relational/core.py Outdated Show resolved Hide resolved
src/gretel_trainer/relational/json.py Outdated Show resolved Hide resolved
tests/relational/conftest.py Outdated Show resolved Hide resolved
tests/relational/test_backup.py Show resolved Hide resolved
tests/relational/test_relational_data_with_json.py Outdated Show resolved Hide resolved
tests/relational/test_relational_data_with_json.py Outdated Show resolved Hide resolved
tests/relational/test_relational_data_with_json.py Outdated Show resolved Hide resolved
src/gretel_trainer/relational/core.py Outdated Show resolved Hide resolved
tests/relational/conftest.py Show resolved Hide resolved
Copy link
Contributor

@mikeknep mikeknep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@benmccown benmccown merged commit 1a59fe0 into main Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants