Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds option to take explicit nested schema for nested JSON reader #11682

Merged
merged 253 commits into from
Sep 22, 2022

Conversation

elstehle
Copy link
Contributor

@elstehle elstehle commented Sep 12, 2022

Description

This PR adds the option to take an explicit nested schema, allowing users to specify the target data types of the leave columns in the nested JSON reader. This PR adds the corresponding interface and implementation to libcudf.

In addition, the PR makes existing JSON reader tests parametrised tests and enables those tests for dual execution of (1) the existing JSON reader and (2) the new nested JSON reader.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@codecov
Copy link

codecov bot commented Sep 19, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@5c91739). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11682   +/-   ##
===============================================
  Coverage                ?   85.90%           
===============================================
  Files                   ?      151           
  Lines                   ?    23601           
  Branches                ?        0           
===============================================
  Hits                    ?    20275           
  Misses                  ?     3326           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@github-actions github-actions bot removed the Python Affects Python cuDF API. label Sep 20, 2022
@elstehle elstehle removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 20, 2022
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
<< (ret.has_value() ? std::to_string(static_cast<int>(ret->type.id())) : "n/a")
<< ", with " << (ret.has_value() ? ret->child_types.size() : 0) << " children"
<< "\n";
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these be removed?

Copy link
Contributor

@karthikeyann karthikeyann Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the debug prints inconvenient while reading the code?
These debug prints help a lot during development and debugging.
any suggestions to mitigate this and still keep the debug prints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find them in the way as a reader/reviewer. And most code is read more than written.
Does this code need to be maintained? Should it be reviewed for correctness?
If yes to both then we should leave it.
We should probably have rules for this since I'm seeing this show up in other code reviews.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I also find them hindering the reading flow but, as @karthikeyann pointed out, they also proofed to be really helpful for debugging. My plan and suggestion would be to keep them for a couple more weeks until we gained confidence that everything is working reliably and to remove them afterwards.

cpp/src/io/json/nested_json_gpu.cu Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
only comment on std::nullopt. not need to address it in this PR. will address this when I implement this same feature in full gpu json parser path.

-> std::optional<schema_element> {
auto ret = (user_dtypes.find(col_name) != std::end(user_dtypes))
? user_dtypes.find(col_name)->second
: std::optional<schema_element>{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::nullopt is preferred.

@karthikeyann karthikeyann mentioned this pull request Sep 21, 2022
3 tasks
@elstehle
Copy link
Contributor Author

rerun tests

@elstehle
Copy link
Contributor Author

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants