-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for json lines format to the nested JSON reader #11534
Adds support for json lines format to the nested JSON reader #11534
Conversation
…fea-read_json-experimental
Co-authored-by: Bradley Dice <[email protected]>
…cudf into fea-read_json-experimental
…fea-read_json-experimental
…exp-read_json-adapter
…exp-read_json-adapter
Codecov Report
@@ Coverage Diff @@
## branch-22.10 #11534 +/- ##
===============================================
Coverage ? 86.41%
===============================================
Files ? 145
Lines ? 22959
Branches ? 0
===============================================
Hits ? 19839
Misses ? 3120
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Agree 👍 I've added a couple more tests to this PR now. Planning to add more extensive testing once we also cover type inference and type casting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge the two tests into one with parametrization and use BytesIO
for quicker IO than touching the disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments for documentation.
Rest all looks great 👍 🚀
@@ -468,256 +471,411 @@ auto get_transition_table() | |||
auto get_translation_table() | |||
{ | |||
std::array<std::array<std::vector<char>, NUM_PDA_SGIDS>, PD_NUM_STATES> pda_tlt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a lighter note, to reduce verbosity, this could be added only inside this function
to get rid of repeated typing of token_t::
constexpr auto StructBegin = token_t::StructBegin;
constexpr auto StructEnd = token_t::StructEnd;
constexpr auto ListBegin = token_t::ListBegin;
constexpr auto ListEnd = token_t::ListEnd;
constexpr auto StructMemberBegin = token_t::StructMemberBegin;
constexpr auto StructMemberEnd = token_t::StructMemberEnd;
constexpr auto FieldNameBegin = token_t::FieldNameBegin;
constexpr auto FieldNameEnd = token_t::FieldNameEnd;
constexpr auto StringBegin = token_t::StringBegin;
constexpr auto StringEnd = token_t::StringEnd;
constexpr auto ValueBegin = token_t::ValueBegin;
constexpr auto ValueEnd = token_t::ValueEnd;
constexpr auto ErrorBegin = token_t::ErrorBegin;
and follow similar column like arrangement above.
only issue is that, some symbol groups have multiple outputs, so, indentation will not be regular.
so, you could wrap around clang-format off
and clang-format on
to keep the code's custom alignment (for eg. 5 entries per row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bdice for suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've shortened the names.
Is there a way to allow people to automatically reformat after they modified the translation table? Because we do want to keep in mind that this is a living piece of code. We pursued this approach because it is flexible and can be adapted to meet current and future feature requests. That means the translation table will be subject to change and additions. I would like to keep barrier to contributions low and avoid people having to fiddle with manually inserting tabs and whitespaces. That's why we commented the table to the best of our abilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 🚀
@gpucibot merge |
Fixes compile warning introduced in #11534 ``` /cudf/cpp/src/io/json/nested_json_gpu.cu(970): warning #177-D: variable "single_item_count" was declared but never referenced ``` Removed unreferenced variable declaration. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #11607
Fixes compile error introduced in PR #11466 due to mismatched changes occurring in PR #11534 https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/11851/console Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) URL: #11637
This PR adds support for the json lines (aka newline-delimited json) format to the nested JSON reader.
Checklist
Thanks to @upsj, who made the translation easier to read for all of us 🙏