-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Enable schema_element
& keep_quotes
support in json reader
#11746
[REVIEW] Enable schema_element
& keep_quotes
support in json reader
#11746
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11746 +/- ##
===============================================
Coverage ? 87.52%
===============================================
Files ? 133
Lines ? 21796
Branches ? 0
===============================================
Hits ? 19076
Misses ? 2720
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
cu_df = cudf.read_json( | ||
buffer, lines=True, dtype={"0": "bool", "1": "long"} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the dtype
argument going to be a breaking change for the engine='cudf'
reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, dtype
is going to continue to work for engine='cudf'
reader. But the reason I've deprecated supporting the list of dtypes in dtype
param is there is no way we can give the column a name with schema_info
being introduced going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype=list
is going to a breaking change once we drop it. But this PR is just deprecating. Pandas don't have it and hence we won't be needing it either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor issues, but looks good overall
Co-authored-by: Lawrence Mitchell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new tests and modified old tests could also use cudf_experimental
also as engine.
python/cudf/cudf/io/json.py
Outdated
FutureWarning, | ||
) | ||
|
||
if engine in {"cudf", "cudf_experimental"} and not lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With PR #11714, cudf_experimental
will support both JSON lines, and also records.
Also, old cudf_experimental
already supports records format.
if engine in {"cudf", "cudf_experimental"} and not lines: | |
if engine == "cudf" and not lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karthikeyann, is #11714 ready to merge for 22.10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if tests pass, it should be. Addressed most of review comments, left some for next PRs.
I merged this PR too locally and tested with above change. Tests pass.
experimental engine could use more python tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "[1, 2, 3]\n[4, 5, 6]\n[7, 8, 9]\n"
a valid JSON lines?
This is not supported by cudf_experimental
engine. Record orient with lines=True/False is only supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is
"[1, 2, 3]\n[4, 5, 6]\n[7, 8, 9]\n"
a valid JSON lines?
It is:
In [1]: import pandas as pd
In [2]: json = "[1, 2, 3]\n[4, 5, 6]\n[7, 8, 9]\n"
In [5]: pd.read_json(json, lines=True, orient="records")
Out[5]:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new tests looks more complicated than I'd imagine, but I'm sure there's a web of inter-library bugs that forces us to validate in a roundabout way :)
dtype={ | ||
"a": cudf.StructDtype( | ||
{ | ||
"a": cudf.StructDtype({"b": cudf.dtype("int64")}), | ||
"b": cudf.dtype("float64"), | ||
} | ||
), | ||
"b": cudf.ListDtype(cudf.ListDtype("int64")), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool to see the API in use in a more demanding case.
) | ||
|
||
pdf = pd.read_json( | ||
StringIO(expected_json_str), orient="records", lines=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprised we need expected_json_str
. Pandas cannot read actual_json_str
and enforce the same types as cudf here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, pandas will read them as str/object types. So we would need complex workarounds for that too. Since the test already has a workaround I refrained from adding more workaround code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the discussion at https://github.com/rapidsai/cudf/pull/11746/files#diff-ccd1c71f32bc5826f9accf0951f80c6c58c2acc0ca8dfe4f37272c668d25ae14L29 looks good to my eyes.
@gpucibot merge |
Description
This PR plumbs
schema_element
andkeep_quotes
support in json reader.Deprecation: This PR also contains changes deprecating
dtype
aslist
inputs. This seems to be a very outdated legacy feature we continued to support and cannot be supported with theschema_element
.Checklist