Reduce unnecessary JSON parsing #112

mikeknep · 2023-05-24T20:27:40Z

I noticed during an end-to-end test of a set of large tables that the initial check-for-JSON-and-normalize-it work that happens when we call add_table was taking a long time, even though none of the columns contained JSON data! The changes here speed up the entire process:

Only operate on columns with dtype "object" (we can confidently ignore int, float, etc. columns)
Before kicking off the full json normalization function, preview the first 5 not-null values in each object-dtype column in the dataframe...
- If none of these contain JSON, don't bother executing json normalization at all
- If any do contain JSON, pass those column names in to the "initial round" of _normalize_json to further reduce the amount of parsing necessary (in addition skipping non-object-dtype columns, we'd also skip columns that we just determined contain "non-JSON strings"). Subsequent invocations of _normalize_json (as we descend levels of nesting) do not pass in an explicit set of columns

Using 5 records for the preview/check was totally arbitrarily; I'm open to including more or fewer records if anyone has an opinion / good reason.

tylersbray

I had a nit and some confusion, so I'm approving this... but if my confusion is justified and you want me to look at changes, lemme know.

src/gretel_trainer/relational/json.py

tylersbray · 2023-05-25T16:23:26Z

src/gretel_trainer/relational/json.py

+    return [
+        col
+        for col, series in column_previews.items()
+        if series.apply(is_dict).all() or series.apply(is_list).all()


And I like the all here instead of an any, if it doesn't seem to be consistently json then an any might just catch a stray and not a fully valid column.

For the check itself, just wondering, would attempting a json.loads be a good way to check? With what you have now, are we forcing the column to be either all dicts or all lists, vs. allowing a mix? Is that what you want?

EDIT: Okay, so I read the bottom of this file first, then saw _normalize_json above. I see that you handle dicts and lists separately and differently. But I also saw that elif... again, what if there's a mix of dicts and lists? Should that elif be an if?

This is a really interesting call-out.

First, current context: the if/elif works currently even with a mix of dicts and lists because if we enter the if dict_cols branch, we transform the dict values (lifting them to columns) and add the resulting df to the nested_dfs list; then on the next iteration of the function we shouldn't find any dicts, but we will find the (unaltered, original) list columns and deal with them then. What's interesting, though, is that this means we're parsing the list columns the first time through, but then dropping the result on the floor and picking it up again the next time.

I tried switching the elif to an if and a bunch of stuff blows up—I think trying to combine the two transformations in one iteration through the function is too complicated.

Instead, I'm pushing up a change that defers the check for any list_columns so that if we find any dict columns first, we just deal with them immediately. We'll still have a few iterations of the function that check both, but at least it'll be fewer?

I'm happy to talk through the logic here if easier. The code currently works with dicts within lists and lists within dicts, but the top-level of each column must be either all list, or all dict. I don't think anything is dropped on the floor - I could be wrong though. A table might be processed multiple times because it contains a list inside a dict inside a list inside a dict, etc. so this flattening each of these is handled in recursive steps (where the next run sees the output of the previous run, i.e. the same table with one less level of nesting).

In other words the code expects JSON with some sort of reasonable schema. Optional/non-existent params/columns are OK, but the same param/column (including both top-level columns and invented columns) cannot be a list sometimes, and a dict other times. Handling the latter is quite complex and may require making some assumptions which may not always be correct, and I'm hoping that most production datasets are much better behaved than that, so I left it out for now (left it out here means that the JSON special handling is not applied to these columns, so they are treated as categorical).

Oh good call; I didn't read @tylersbray 's comment that way initially, but re-reading it now I see that interpretation and yeah, totally agree that there's an expectation here that a single column does not contain e.g. {"foo": 2} and [1, 2, 3].

Looking at the code again, this line might be unnecessarily computed list_cols = [col for col in cols_to_scan if df[col].dropna().apply(is_list).all()] if dict_cols is not empty, so depending on how expensive it is, it might make sense to defer computing it until after we know that dict_cols is empty (change the elif to else: list_cols = ... then if list_cols: ...).

@sami-t exactly, that's the change I made in 623e15b

tylersbray · 2023-05-25T17:30:48Z

tests/relational/test_relational_data_with_json.py

@@ -759,6 +759,33 @@ def test_lists_of_lists():
    )


+def test_mix_of_dict_and_list_cols():


Thanks for enshrining this in a test.

sami-t

I haven't looked at the code in detail but the logic in the PR description SGTM. Since we currently only consider columns to be JSON columns if all rows (excluding NaN) are either dictionaries, or lists, sampling to 5 non-NaN values seems sufficient. In other words, columns that contain some JSON and some literals, or even all JSON but some lists and some dicts are not considered to be JSON columns downstream in the code (IIRC), so there is no need to scan them once we determine that there is at least one mismatch. One reason for a larger sample is to more quickly rule out columns that are (for example) 90%+ JSON, but I hope most datasets are much better behaved than that, which would make a smaller sample size faster in aggregate.

Reduce unnecessary JSON parsing

db924c5

mikeknep requested review from pimlock, tylersbray and sami-t May 24, 2023 20:27

Check json columns before full RelJson parsing when updating table data

6e2d9e7

mikeknep force-pushed the json-peek branch from e640c39 to 6e2d9e7 Compare May 24, 2023 21:04

mikeknep mentioned this pull request May 24, 2023

Fix bug related to empty normalized tables #113

Merged

tylersbray approved these changes May 25, 2023

View reviewed changes

mikeknep added 3 commits May 25, 2023 11:51

Constant for preview row count

eb2db17

Add unit test with both dict and list columns

01e44a6

Don't check for list cols if we found dict cols to transform

623e15b

tylersbray reviewed May 25, 2023

View reviewed changes

sami-t reviewed May 25, 2023

View reviewed changes

Extract method for checking and parsing json

008d938

pimlock approved these changes May 25, 2023

View reviewed changes

mikeknep merged commit e89359a into main May 25, 2023

mikeknep deleted the json-peek branch May 25, 2023 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce unnecessary JSON parsing #112

Reduce unnecessary JSON parsing #112

mikeknep commented May 24, 2023 •

edited

Loading

tylersbray left a comment

tylersbray May 25, 2023

mikeknep May 25, 2023

sami-t May 25, 2023

sami-t May 25, 2023

mikeknep May 25, 2023

sami-t May 25, 2023 •

edited

Loading

mikeknep May 25, 2023

tylersbray May 25, 2023

sami-t left a comment

		@@ -759,6 +759,33 @@ def test_lists_of_lists():
		)


		def test_mix_of_dict_and_list_cols():

Reduce unnecessary JSON parsing #112

Reduce unnecessary JSON parsing #112

Conversation

mikeknep commented May 24, 2023 • edited Loading

tylersbray left a comment

Choose a reason for hiding this comment

tylersbray May 25, 2023

Choose a reason for hiding this comment

mikeknep May 25, 2023

Choose a reason for hiding this comment

sami-t May 25, 2023

Choose a reason for hiding this comment

sami-t May 25, 2023

Choose a reason for hiding this comment

mikeknep May 25, 2023

Choose a reason for hiding this comment

sami-t May 25, 2023 • edited Loading

Choose a reason for hiding this comment

mikeknep May 25, 2023

Choose a reason for hiding this comment

tylersbray May 25, 2023

Choose a reason for hiding this comment

sami-t left a comment

Choose a reason for hiding this comment

mikeknep commented May 24, 2023 •

edited

Loading

sami-t May 25, 2023 •

edited

Loading