Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

byte_range support for JSON Lines format #12017

Merged
merged 33 commits into from
Nov 16, 2022

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Oct 27, 2022

Description

This PR adds support for byte_range to be used in nested JSON parser for JSON Lines format (newline delimited JSON http://ndjson.org/)
The record delimiter "New lines" are only expected at the end of each record. Newlines in middle of record or within quotes are not expected and will lead to unknown behaviour. The record delimiters are not context aware in this PR.

This PR provides libcudf APIs, Cython APIs and python tests to enable byte range support. This will allow dask to do distributed/segmented parsing of JSON.

No Dask changes

Addresses part of #11843
Depends on #12060

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Cython non-breaking Non-breaking change labels Oct 27, 2022
@karthikeyann karthikeyann added this to the Nested JSON reader milestone Oct 27, 2022
@karthikeyann karthikeyann self-assigned this Oct 27, 2022
@github-actions github-actions bot added the CMake CMake build issue label Oct 27, 2022
@codecov
Copy link

codecov bot commented Oct 28, 2022

Codecov Report

Base: 87.47% // Head: 88.14% // Increases project coverage by +0.66% 🎉

Coverage data is based on head (2b1cd6b) compared to base (f817d96).
Patch has no changes to coverable lines.

❗ Current head 2b1cd6b differs from pull request most recent head 80cff7f. Consider uploading reports for the commit 80cff7f to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #12017      +/-   ##
================================================
+ Coverage         87.47%   88.14%   +0.66%     
================================================
  Files               133      135       +2     
  Lines             21826    22126     +300     
================================================
+ Hits              19093    19503     +410     
+ Misses             2733     2623     -110     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/interval.py 85.45% <0.00%> (-9.10%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/cudf/cudf/core/column/timedelta.py 90.17% <0.00%> (-0.58%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.21% <0.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/column.py 87.96% <0.00%> (-0.46%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
python/cudf/cudf/io/parquet.py 90.45% <0.00%> (-0.39%) ⬇️
... and 43 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 2, 2022
@karthikeyann karthikeyann requested a review from rjzamora November 3, 2022 15:05
@karthikeyann karthikeyann removed the 2 - In Progress Currently a work in progress label Nov 3, 2022
@karthikeyann karthikeyann added 4 - Needs Review Waiting for reviewer to review or respond 4 - Needs Dask Reviewer and removed Cython labels Nov 16, 2022
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Mostly just minor comments / nitpicks.

cpp/src/io/json/experimental/byte_range_info.cu Outdated Show resolved Hide resolved
cpp/src/io/json/experimental/byte_range_info.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_chunked_reader.cpp Outdated Show resolved Hide resolved
cpp/tests/io/json_chunked_reader.cpp Outdated Show resolved Hide resolved
cpp/tests/io/json_chunked_reader.cpp Outdated Show resolved Hide resolved
cpp/tests/io/json_chunked_reader.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two very minor things.

cpp/src/io/json/experimental/read_json.cpp Show resolved Hide resolved
python/cudf/cudf/tests/test_json.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_json.py Outdated Show resolved Hide resolved
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes look good

@karthikeyann karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond 4 - Needs cuIO Reviewer labels Nov 16, 2022
@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit defad5e into rapidsai:branch-22.12 Nov 16, 2022
rapids-bot bot pushed a commit that referenced this pull request Dec 1, 2022
This issue was introduced in #12017 merged, which triggers compiler error on some systems:
```
../tests/io/json_chunked_reader.cpp: In function 'std::vector<cudf::io::table_with_metadata> skeleton_for_parellel_chunk_reader(cudf::host_span<std::unique_ptr<cudf::io::datasource> >, const cudf::io::json_reader_options&, int32_t, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)':
../tests/io/json_chunked_reader.cpp:78:19: error: loop variable '<structured bindings>' creates a copy from type 'const std::pair<int, int>' [-Werror=range-loop-construct]
   78 |   for (auto const [chunk_start, chunk_end] : record_ranges) {
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~
../tests/io/json_chunked_reader.cpp:78:19: note: use reference type to prevent copying
   78 |   for (auto const [chunk_start, chunk_end] : record_ranges) {
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~
      |                   &
```

Fixing it is just by following the compiler recommendation: "use reference type to prevent copying".

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #12280
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants