-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: nested json table string length #135
Conversation
In cases where we are parsing nested json with lengthy key strings the combination of key strings in the breadcrumb based table name string would result in an error when trying to train the model due to a maximum json filename length of 128 chars. Here we utilize the already existing make_suffix function (which ensures no sanitized_str collisions) and we truncate the length of the table name to ensure we do not exceed the 128 char limit.
Let me know if I am missing anything in the unit test coverage. I think the tests I added should cover us but I wasn't sure if there were any other existing tests that would benefit from having the real |
Instead of sanitizing the full json path to an invented table, we create a table name based on the producer table with a uuid to prevent collisions, and we map the full json path to the unique table string.
When nested json structures are extracted to individual dataframes they will be written to resulting individual tables. Since we are creating table names with the convention <producer_table>_invented_<uuid> we now store the mapping details for tables in the producer table's debug summary, as well as the json breadcrumb path in each invented table's individual debug summary.
The latest commits incorporate feedback gathered via slack to name invented tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! Most of these suggestions come down to style / lesser-known features and functions; there is one test change that is confusing me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
In cases where we are parsing nested json with lengthy key strings the combination of key strings in the breadcrumb based table name string would result in an error when trying to train the model due to a maximum json filename length of 128 chars. Here we utilize the already existing make_suffix function (which ensures no sanitized_str collisions) and we truncate the length of the table name to ensure we do not exceed the 128 char limit.