[REVIEW] Enable `schema_element` & `keep_quotes` support in json reader #11746

galipremsagar · 2022-09-22T20:21:26Z

Description

This PR plumbs schema_element and keep_quotes support in json reader.

Deprecation: This PR also contains changes deprecating dtype as list inputs. This seems to be a very outdated legacy feature we continued to support and cannot be supported with the schema_element.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

codecov · 2022-09-22T22:28:14Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@e64c2da). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head abae509 differs from pull request most recent head 3b5bacc. Consider uploading reports for the commit 3b5bacc to get more accurate results

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-22.10   #11746   +/-   ##
===============================================
  Coverage                ?   87.52%           
===============================================
  Files                   ?      133           
  Lines                   ?    21796           
  Branches                ?        0           
===============================================
  Hits                    ?    19076           
  Misses                  ?     2720           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

GregoryKimball · 2022-09-23T00:11:01Z

python/cudf/cudf/tests/test_json.py

+    cu_df = cudf.read_json(
+        buffer, lines=True, dtype={"0": "bool", "1": "long"}
+    )


Is the dtype argument going to be a breaking change for the engine='cudf' reader?

Nope, dtype is going to continue to work for engine='cudf' reader. But the reason I've deprecated supporting the list of dtypes in dtype param is there is no way we can give the column a name with schema_info being introduced going forward.

dtype=list is going to a breaking change once we drop it. But this PR is just deprecating. Pandas don't have it and hence we won't be needing it either.

wence-

A few minor issues, but looks good overall

python/cudf/cudf/_lib/json.pyx

python/cudf/cudf/io/json.py

python/cudf/cudf/tests/test_json.py

python/cudf/cudf/utils/ioutils.py

Co-authored-by: Lawrence Mitchell <[email protected]>

…ent_python

karthikeyann

new tests and modified old tests could also use cudf_experimental also as engine.

karthikeyann · 2022-09-26T20:04:46Z

python/cudf/cudf/io/json.py

+            FutureWarning,
+        )
+
+    if engine in {"cudf", "cudf_experimental"} and not lines:


With PR #11714, cudf_experimental will support both JSON lines, and also records.
Also, old cudf_experimental already supports records format.

Suggested change

if engine in {"cudf", "cudf_experimental"} and not lines:

if engine == "cudf" and not lines:

@karthikeyann, is #11714 ready to merge for 22.10?

if tests pass, it should be. Addressed most of review comments, left some for next PRs.
I merged this PR too locally and tested with above change. Tests pass.
experimental engine could use more python tests.

is "[1, 2, 3]\n[4, 5, 6]\n[7, 8, 9]\n" a valid JSON lines?
This is not supported by cudf_experimental engine. Record orient with lines=True/False is only supported.

is "[1, 2, 3]\n[4, 5, 6]\n[7, 8, 9]\n" a valid JSON lines?

It is:

In [1]: import pandas as pd In [2]: json = "[1, 2, 3]\n[4, 5, 6]\n[7, 8, 9]\n" In [5]: pd.read_json(json, lines=True, orient="records") Out[5]: 0 1 2 0 1 2 3 1 4 5 6 2 7 8 9

…ent_python

python/cudf/cudf/tests/test_json.py

…ent_python

vuule

The new tests looks more complicated than I'd imagine, but I'm sure there's a web of inter-library bugs that forces us to validate in a roundabout way :)

vuule · 2022-09-27T07:25:18Z

python/cudf/cudf/tests/test_json.py

+        dtype={
+            "a": cudf.StructDtype(
+                {
+                    "a": cudf.StructDtype({"b": cudf.dtype("int64")}),
+                    "b": cudf.dtype("float64"),
+                }
+            ),
+            "b": cudf.ListDtype(cudf.ListDtype("int64")),
+        },


cool to see the API in use in a more demanding case.

vuule · 2022-09-27T07:36:24Z

python/cudf/cudf/tests/test_json.py

+    )
+
+    pdf = pd.read_json(
+        StringIO(expected_json_str), orient="records", lines=True


Surprised we need expected_json_str. Pandas cannot read actual_json_str and enforce the same types as cudf here?

Nope, pandas will read them as str/object types. So we would need complex workarounds for that too. Since the test already has a workaround I refrained from adding more workaround code.

wence-

Other than the discussion at https://github.com/rapidsai/cudf/pull/11746/files#diff-ccd1c71f32bc5826f9accf0951f80c6c58c2acc0ca8dfe4f37272c668d25ae14L29 looks good to my eyes.

galipremsagar · 2022-09-27T09:50:31Z

@gpucibot merge

galipremsagar added 6 commits September 22, 2022 11:54

add schema_element support

3425dd7

add keep_quotes param

4621855

fix warning

dd05c2c

fix warning

7235cbd

add docstring

323c5b2

fix

9594fc0

galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer labels Sep 22, 2022

galipremsagar self-assigned this Sep 22, 2022

galipremsagar requested a review from a team as a code owner September 22, 2022 20:21

galipremsagar requested review from wence- and bdice September 22, 2022 20:21

galipremsagar requested a review from vuule September 22, 2022 20:22

galipremsagar added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Sep 22, 2022

change tests

c5dbb1c

galipremsagar added the deprecation label Sep 22, 2022

cleanup

4ac82da

galipremsagar removed the request for review from wence- September 22, 2022 20:51

GregoryKimball reviewed Sep 23, 2022

View reviewed changes

wence- requested changes Sep 26, 2022

View reviewed changes

galipremsagar and others added 4 commits September 26, 2022 09:43

Apply suggestions from code review

cad50cf

Co-authored-by: Lawrence Mitchell <[email protected]>

Merge remote-tracking branch 'upstream/branch-22.10' into schema_elem…

43dc243

…ent_python

address reviews

66be512

add docstring example

6e6a55c

address review

daea9b2

galipremsagar requested a review from wence- September 26, 2022 17:37

karthikeyann reviewed Sep 26, 2022

View reviewed changes

karthikeyann mentioned this pull request Sep 26, 2022

JSON parser integration #11717

Closed

3 tasks

galipremsagar added 3 commits September 26, 2022 13:17

Merge remote-tracking branch 'upstream/branch-22.10' into schema_elem…

e5bdc74

…ent_python

updates

af32f3e

Merge branch 'rapidsai:branch-22.10' into schema_element_python

289cac2

vuule reviewed Sep 27, 2022

View reviewed changes

python/cudf/cudf/tests/test_json.py Show resolved Hide resolved

galipremsagar added 2 commits September 26, 2022 17:51

Merge remote-tracking branch 'upstream/branch-22.10' into schema_elem…

55e3b32

…ent_python

add nested dtypes test

3b5bacc

galipremsagar requested a review from vuule September 27, 2022 01:23

vuule approved these changes Sep 27, 2022

View reviewed changes

galipremsagar removed the 4 - Needs cuIO Reviewer label Sep 27, 2022

wence- approved these changes Sep 27, 2022

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Sep 27, 2022

rapids-bot bot merged commit 35b0a52 into rapidsai:branch-22.10 Sep 27, 2022

GregoryKimball added this to the Nested JSON reader milestone Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Enable `schema_element` & `keep_quotes` support in json reader #11746

[REVIEW] Enable `schema_element` & `keep_quotes` support in json reader #11746

galipremsagar commented Sep 22, 2022 •

edited

Loading

codecov bot commented Sep 22, 2022 •

edited

Loading

GregoryKimball Sep 23, 2022

galipremsagar Sep 23, 2022

galipremsagar Sep 23, 2022 •

edited

Loading

wence- left a comment

karthikeyann left a comment

karthikeyann Sep 26, 2022 •

edited

Loading

galipremsagar Sep 26, 2022

karthikeyann Sep 26, 2022

karthikeyann Sep 26, 2022

galipremsagar Sep 26, 2022

vuule left a comment

vuule Sep 27, 2022

vuule Sep 27, 2022

galipremsagar Sep 27, 2022

wence- left a comment

galipremsagar commented Sep 27, 2022

	if engine in {"cudf", "cudf_experimental"} and not lines:
	if engine == "cudf" and not lines:

[REVIEW] Enable schema_element & keep_quotes support in json reader #11746

[REVIEW] Enable schema_element & keep_quotes support in json reader #11746

Conversation

galipremsagar commented Sep 22, 2022 • edited Loading

Description

Checklist

codecov bot commented Sep 22, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galipremsagar Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

karthikeyann Sep 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

galipremsagar commented Sep 27, 2022

[REVIEW] Enable `schema_element` & `keep_quotes` support in json reader #11746

[REVIEW] Enable `schema_element` & `keep_quotes` support in json reader #11746

galipremsagar commented Sep 22, 2022 •

edited

Loading

codecov bot commented Sep 22, 2022 •

edited

Loading

galipremsagar Sep 23, 2022 •

edited

Loading

karthikeyann Sep 26, 2022 •

edited

Loading