Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce unnecessary JSON parsing #112

Merged
merged 6 commits into from
May 25, 2023
Merged

Reduce unnecessary JSON parsing #112

merged 6 commits into from
May 25, 2023

Conversation

mikeknep
Copy link
Contributor

@mikeknep mikeknep commented May 24, 2023

I noticed during an end-to-end test of a set of large tables that the initial check-for-JSON-and-normalize-it work that happens when we call add_table was taking a long time, even though none of the columns contained JSON data! The changes here speed up the entire process:

  • Only operate on columns with dtype "object" (we can confidently ignore int, float, etc. columns)
  • Before kicking off the full json normalization function, preview the first 5 not-null values in each object-dtype column in the dataframe...
    • If none of these contain JSON, don't bother executing json normalization at all
    • If any do contain JSON, pass those column names in to the "initial round" of _normalize_json to further reduce the amount of parsing necessary (in addition skipping non-object-dtype columns, we'd also skip columns that we just determined contain "non-JSON strings"). Subsequent invocations of _normalize_json (as we descend levels of nesting) do not pass in an explicit set of columns

Using 5 records for the preview/check was totally arbitrarily; I'm open to including more or fewer records if anyone has an opinion / good reason.

@mikeknep mikeknep requested review from pimlock, tylersbray and sami-t May 24, 2023 20:27
Copy link
Contributor

@tylersbray tylersbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a nit and some confusion, so I'm approving this... but if my confusion is justified and you want me to look at changes, lemme know.

return [
col
for col, series in column_previews.items()
if series.apply(is_dict).all() or series.apply(is_list).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I like the all here instead of an any, if it doesn't seem to be consistently json then an any might just catch a stray and not a fully valid column.

For the check itself, just wondering, would attempting a json.loads be a good way to check? With what you have now, are we forcing the column to be either all dicts or all lists, vs. allowing a mix? Is that what you want?

EDIT: Okay, so I read the bottom of this file first, then saw _normalize_json above. I see that you handle dicts and lists separately and differently. But I also saw that elif... again, what if there's a mix of dicts and lists? Should that elif be an if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really interesting call-out.

First, current context: the if/elif works currently even with a mix of dicts and lists because if we enter the if dict_cols branch, we transform the dict values (lifting them to columns) and add the resulting df to the nested_dfs list; then on the next iteration of the function we shouldn't find any dicts, but we will find the (unaltered, original) list columns and deal with them then. What's interesting, though, is that this means we're parsing the list columns the first time through, but then dropping the result on the floor and picking it up again the next time.

I tried switching the elif to an if and a bunch of stuff blows up—I think trying to combine the two transformations in one iteration through the function is too complicated.

Instead, I'm pushing up a change that defers the check for any list_columns so that if we find any dict columns first, we just deal with them immediately. We'll still have a few iterations of the function that check both, but at least it'll be fewer?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to talk through the logic here if easier. The code currently works with dicts within lists and lists within dicts, but the top-level of each column must be either all list, or all dict. I don't think anything is dropped on the floor - I could be wrong though. A table might be processed multiple times because it contains a list inside a dict inside a list inside a dict, etc. so this flattening each of these is handled in recursive steps (where the next run sees the output of the previous run, i.e. the same table with one less level of nesting).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words the code expects JSON with some sort of reasonable schema. Optional/non-existent params/columns are OK, but the same param/column (including both top-level columns and invented columns) cannot be a list sometimes, and a dict other times. Handling the latter is quite complex and may require making some assumptions which may not always be correct, and I'm hoping that most production datasets are much better behaved than that, so I left it out for now (left it out here means that the JSON special handling is not applied to these columns, so they are treated as categorical).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good call; I didn't read @tylersbray 's comment that way initially, but re-reading it now I see that interpretation and yeah, totally agree that there's an expectation here that a single column does not contain e.g. {"foo": 2} and [1, 2, 3].

Copy link

@sami-t sami-t May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code again, this line might be unnecessarily computed list_cols = [col for col in cols_to_scan if df[col].dropna().apply(is_list).all()] if dict_cols is not empty, so depending on how expensive it is, it might make sense to defer computing it until after we know that dict_cols is empty (change the elif to else: list_cols = ... then if list_cols: ...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sami-t exactly, that's the change I made in 623e15b

@@ -759,6 +759,33 @@ def test_lists_of_lists():
)


def test_mix_of_dict_and_list_cols():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for enshrining this in a test.

Copy link

@sami-t sami-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the code in detail but the logic in the PR description SGTM. Since we currently only consider columns to be JSON columns if all rows (excluding NaN) are either dictionaries, or lists, sampling to 5 non-NaN values seems sufficient. In other words, columns that contain some JSON and some literals, or even all JSON but some lists and some dicts are not considered to be JSON columns downstream in the code (IIRC), so there is no need to scan them once we determine that there is at least one mismatch. One reason for a larger sample is to more quickly rule out columns that are (for example) 90%+ JSON, but I hope most datasets are much better behaved than that, which would make a smaller sample size faster in aggregate.

@mikeknep mikeknep merged commit e89359a into main May 25, 2023
@mikeknep mikeknep deleted the json-peek branch May 25, 2023 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants