-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce unnecessary JSON parsing #112
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a nit and some confusion, so I'm approving this... but if my confusion is justified and you want me to look at changes, lemme know.
return [ | ||
col | ||
for col, series in column_previews.items() | ||
if series.apply(is_dict).all() or series.apply(is_list).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I like the all
here instead of an any
, if it doesn't seem to be consistently json then an any
might just catch a stray and not a fully valid column.
For the check itself, just wondering, would attempting a json.loads
be a good way to check? With what you have now, are we forcing the column to be either all dicts or all lists, vs. allowing a mix? Is that what you want?
EDIT: Okay, so I read the bottom of this file first, then saw _normalize_json above. I see that you handle dicts and lists separately and differently. But I also saw that elif
... again, what if there's a mix of dicts and lists? Should that elif
be an if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really interesting call-out.
First, current context: the if/elif
works currently even with a mix of dicts and lists because if we enter the if dict_cols
branch, we transform the dict values (lifting them to columns) and add the resulting df to the nested_dfs
list; then on the next iteration of the function we shouldn't find any dicts, but we will find the (unaltered, original) list columns and deal with them then. What's interesting, though, is that this means we're parsing the list columns the first time through, but then dropping the result on the floor and picking it up again the next time.
I tried switching the elif
to an if
and a bunch of stuff blows up—I think trying to combine the two transformations in one iteration through the function is too complicated.
Instead, I'm pushing up a change that defers the check for any list_columns so that if we find any dict columns first, we just deal with them immediately. We'll still have a few iterations of the function that check both, but at least it'll be fewer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to talk through the logic here if easier. The code currently works with dicts within lists and lists within dicts, but the top-level of each column must be either all list, or all dict. I don't think anything is dropped on the floor - I could be wrong though. A table might be processed multiple times because it contains a list inside a dict inside a list inside a dict, etc. so this flattening each of these is handled in recursive steps (where the next run sees the output of the previous run, i.e. the same table with one less level of nesting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words the code expects JSON with some sort of reasonable schema. Optional/non-existent params/columns are OK, but the same param/column (including both top-level columns and invented columns) cannot be a list sometimes, and a dict other times. Handling the latter is quite complex and may require making some assumptions which may not always be correct, and I'm hoping that most production datasets are much better behaved than that, so I left it out for now (left it out here means that the JSON special handling is not applied to these columns, so they are treated as categorical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good call; I didn't read @tylersbray 's comment that way initially, but re-reading it now I see that interpretation and yeah, totally agree that there's an expectation here that a single column does not contain e.g. {"foo": 2}
and [1, 2, 3]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code again, this line might be unnecessarily computed list_cols = [col for col in cols_to_scan if df[col].dropna().apply(is_list).all()]
if dict_cols is not empty, so depending on how expensive it is, it might make sense to defer computing it until after we know that dict_cols
is empty (change the elif
to else: list_cols = ...
then if list_cols: ...
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -759,6 +759,33 @@ def test_lists_of_lists(): | |||
) | |||
|
|||
|
|||
def test_mix_of_dict_and_list_cols(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for enshrining this in a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the code in detail but the logic in the PR description SGTM. Since we currently only consider columns to be JSON columns if all rows (excluding NaN) are either dictionaries, or lists, sampling to 5 non-NaN values seems sufficient. In other words, columns that contain some JSON and some literals, or even all JSON but some lists and some dicts are not considered to be JSON columns downstream in the code (IIRC), so there is no need to scan them once we determine that there is at least one mismatch. One reason for a larger sample is to more quickly rule out columns that are (for example) 90%+ JSON, but I hope most datasets are much better behaved than that, which would make a smaller sample size faster in aggregate.
I noticed during an end-to-end test of a set of large tables that the initial check-for-JSON-and-normalize-it work that happens when we call
add_table
was taking a long time, even though none of the columns contained JSON data! The changes here speed up the entire process:_normalize_json
to further reduce the amount of parsing necessary (in addition skipping non-object-dtype columns, we'd also skip columns that we just determined contain "non-JSON strings"). Subsequent invocations of_normalize_json
(as we descend levels of nesting) do not pass in an explicit set of columnsUsing 5 records for the preview/check was totally arbitrarily; I'm open to including more or fewer records if anyone has an opinion / good reason.