-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
byte_range support for JSON Lines format #12017
byte_range support for JSON Lines format #12017
Conversation
Codecov ReportBase: 87.47% // Head: 88.14% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #12017 +/- ##
================================================
+ Coverage 87.47% 88.14% +0.66%
================================================
Files 133 135 +2
Lines 21826 22126 +300
================================================
+ Hits 19093 19503 +410
+ Misses 2733 2623 -110
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Mostly just minor comments / nitpicks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two very minor things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes look good
@gpucibot merge |
This issue was introduced in #12017 merged, which triggers compiler error on some systems: ``` ../tests/io/json_chunked_reader.cpp: In function 'std::vector<cudf::io::table_with_metadata> skeleton_for_parellel_chunk_reader(cudf::host_span<std::unique_ptr<cudf::io::datasource> >, const cudf::io::json_reader_options&, int32_t, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)': ../tests/io/json_chunked_reader.cpp:78:19: error: loop variable '<structured bindings>' creates a copy from type 'const std::pair<int, int>' [-Werror=range-loop-construct] 78 | for (auto const [chunk_start, chunk_end] : record_ranges) { | ^~~~~~~~~~~~~~~~~~~~~~~~ ../tests/io/json_chunked_reader.cpp:78:19: note: use reference type to prevent copying 78 | for (auto const [chunk_start, chunk_end] : record_ranges) { | ^~~~~~~~~~~~~~~~~~~~~~~~ | & ``` Fixing it is just by following the compiler recommendation: "use reference type to prevent copying". Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - David Wendt (https://github.com/davidwendt) - Yunsong Wang (https://github.com/PointKernel) URL: #12280
Description
This PR adds support for byte_range to be used in nested JSON parser for JSON Lines format (newline delimited JSON http://ndjson.org/)
The record delimiter "New lines" are only expected at the end of each record. Newlines in middle of record or within quotes are not expected and will lead to unknown behaviour. The record delimiters are not context aware in this PR.
This PR provides libcudf APIs, Cython APIs and python tests to enable byte range support. This will allow dask to do distributed/segmented parsing of JSON.
No Dask changes
Addresses part of #11843
Depends on #12060
Checklist