Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for json lines format to the nested JSON reader #11534

Merged
merged 53 commits into from
Aug 25, 2022

Conversation

elstehle
Copy link
Contributor

@elstehle elstehle commented Aug 15, 2022

This PR adds support for the json lines (aka newline-delimited json) format to the nested JSON reader.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Thanks to @upsj, who made the translation easier to read for all of us 🙏

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 15, 2022
@elstehle elstehle added feature request New feature or request 2 - In Progress Currently a work in progress cuIO cuIO issue non-breaking Non-breaking change labels Aug 15, 2022
@elstehle elstehle added this to the Nested JSON reader milestone Aug 15, 2022
@codecov
Copy link

codecov bot commented Aug 15, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@c666d7c). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 16c62cd differs from pull request most recent head 9822ecb. Consider uploading reports for the commit 9822ecb to get more accurate results

@@               Coverage Diff               @@
##             branch-22.10   #11534   +/-   ##
===============================================
  Coverage                ?   86.41%           
===============================================
  Files                   ?      145           
  Lines                   ?    22959           
  Branches                ?        0           
===============================================
  Hits                    ?    19839           
  Misses                  ?     3120           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@elstehle elstehle marked this pull request as ready for review August 15, 2022 19:22
@elstehle elstehle requested a review from a team as a code owner August 15, 2022 19:22
@elstehle elstehle requested review from vyasr and nvdbaranec August 15, 2022 19:22
@elstehle
Copy link
Contributor Author

Looks good, but tests could be beefed up, given the size of the feature. Did not review the state machine :)

Agree 👍 I've added a couple more tests to this PR now. Planning to add more extensive testing once we also cover type inference and type casting

@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 24, 2022
@elstehle elstehle requested a review from galipremsagar August 24, 2022 19:03
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge the two tests into one with parametrization and use BytesIO for quicker IO than touching the disk.

python/cudf/cudf/tests/test_json.py Outdated Show resolved Hide resolved
@galipremsagar galipremsagar self-requested a review August 24, 2022 19:18
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments for documentation.
Rest all looks great 👍 🚀

cpp/src/io/json/nested_json.hpp Show resolved Hide resolved
cpp/src/io/json/nested_json.hpp Show resolved Hide resolved
@@ -468,256 +471,411 @@ auto get_transition_table()
auto get_translation_table()
{
std::array<std::array<std::vector<char>, NUM_PDA_SGIDS>, PD_NUM_STATES> pda_tlt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a lighter note, to reduce verbosity, this could be added only inside this function
to get rid of repeated typing of token_t::


  constexpr auto StructBegin       = token_t::StructBegin;
  constexpr auto StructEnd         = token_t::StructEnd;
  constexpr auto ListBegin         = token_t::ListBegin;
  constexpr auto ListEnd           = token_t::ListEnd;
  constexpr auto StructMemberBegin = token_t::StructMemberBegin;
  constexpr auto StructMemberEnd   = token_t::StructMemberEnd;
  constexpr auto FieldNameBegin    = token_t::FieldNameBegin;
  constexpr auto FieldNameEnd      = token_t::FieldNameEnd;
  constexpr auto StringBegin       = token_t::StringBegin;
  constexpr auto StringEnd         = token_t::StringEnd;
  constexpr auto ValueBegin        = token_t::ValueBegin;
  constexpr auto ValueEnd          = token_t::ValueEnd;
  constexpr auto ErrorBegin        = token_t::ErrorBegin;

and follow similar column like arrangement above.
only issue is that, some symbol groups have multiple outputs, so, indentation will not be regular.
so, you could wrap around clang-format offand clang-format on to keep the code's custom alignment (for eg. 5 entries per row)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice for suggestions

Copy link
Contributor Author

@elstehle elstehle Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've shortened the names.

Is there a way to allow people to automatically reformat after they modified the translation table? Because we do want to keep in mind that this is a living piece of code. We pursued this approach because it is flexible and can be adapted to meet current and future feature requests. That means the translation table will be subject to change and additions. I would like to keep barrier to contributions low and avoid people having to fiddle with manually inserting tabs and whitespaces. That's why we commented the table to the best of our abilities.

cpp/tests/io/nested_json_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/nested_json_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/json_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/json_test.cpp Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Show resolved Hide resolved
@elstehle elstehle requested a review from karthikeyann August 25, 2022 16:06
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 🚀

@elstehle
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c8c8025 into rapidsai:branch-22.10 Aug 25, 2022
rapids-bot bot pushed a commit that referenced this pull request Aug 26, 2022
Fixes compile warning introduced in #11534 
```
/cudf/cpp/src/io/json/nested_json_gpu.cu(970): warning #177-D: variable "single_item_count" was declared but never referenced
```
Removed unreferenced variable declaration.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #11607
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants