fix: nested json table string length #135

benmccown · 2023-07-12T22:19:41Z

In cases where we are parsing nested json with lengthy key strings the combination of key strings in the breadcrumb based table name string would result in an error when trying to train the model due to a maximum json filename length of 128 chars. Here we utilize the already existing make_suffix function (which ensures no sanitized_str collisions) and we truncate the length of the table name to ensure we do not exceed the 128 char limit.

benmccown · 2023-07-12T22:20:30Z

Let me know if I am missing anything in the unit test coverage. I think the tests I added should cover us but I wasn't sure if there were any other existing tests that would benefit from having the real sanitize_str method included in their run instead of the mocked function.

Instead of sanitizing the full json path to an invented table, we create a table name based on the producer table with a uuid to prevent collisions, and we map the full json path to the unique table string.

When nested json structures are extracted to individual dataframes they will be written to resulting individual tables. Since we are creating table names with the convention <producer_table>_invented_<uuid> we now store the mapping details for tables in the producer table's debug summary, as well as the json breadcrumb path in each invented table's individual debug summary.

benmccown · 2023-07-14T20:44:35Z

The latest commits incorporate feedback gathered via slack to name invented tables <producer_table>_invented_<uuid>. We also stored the resulting mapping info in the debug_summary file now.

mikeknep

Awesome work! Most of these suggestions come down to style / lesser-known features and functions; there is one test change that is confusing me.

src/gretel_trainer/relational/core.py

src/gretel_trainer/relational/json.py

tests/relational/conftest.py

tests/relational/test_backup.py

tests/relational/test_relational_data_with_json.py

src/gretel_trainer/relational/core.py

tests/relational/conftest.py

mikeknep

Great work!

benmccown requested a review from mikeknep July 12, 2023 22:19

benmccown added 2 commits July 14, 2023 12:45

feat: remove json breadcrumbs from table names

a43da64

Instead of sanitizing the full json path to an invented table, we create a table name based on the producer table with a uuid to prevent collisions, and we map the full json path to the unique table string.

mikeknep reviewed Jul 17, 2023

View reviewed changes

benmccown added 2 commits July 17, 2023 10:29

feedback: cleaning up tests

c3eb9c5

feedback: cleaning up tests

bbbe9a2

mikeknep reviewed Jul 17, 2023

View reviewed changes

src/gretel_trainer/relational/core.py Outdated Show resolved Hide resolved

tests/relational/conftest.py Show resolved Hide resolved

feedback: typing for conftest

6f3cbc7

mikeknep approved these changes Jul 17, 2023

View reviewed changes

benmccown merged commit 1a59fe0 into main Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: nested json table string length #135

fix: nested json table string length #135

benmccown commented Jul 12, 2023

benmccown commented Jul 12, 2023

benmccown commented Jul 14, 2023

mikeknep left a comment

mikeknep left a comment

fix: nested json table string length #135

fix: nested json table string length #135

Conversation

benmccown commented Jul 12, 2023

benmccown commented Jul 12, 2023

benmccown commented Jul 14, 2023

mikeknep left a comment

Choose a reason for hiding this comment

mikeknep left a comment

Choose a reason for hiding this comment