Optimizing multi-source byte range reading in JSON reader #15396

shrshi · 2024-03-26T22:57:39Z

Description

This piece of work seeks to achieve two goals - (i) reducing repeated reading of byte range chunks in the JSON reader, and (ii) enabling multi-source byte range reading for chunks spanning sources.

We expand on the idea outlined in Reduce IO when byte_range option is used in read_json #15185 to reduce the repeated reading of follow-on chunks while searching for the end of the last row in the requested chunk. After the requested chunk, the following chunks are divided into subchunks, and read until the delimiter character is reached.
We estimate the buffer size needed for the entire byte range, and compute offsets per source into the buffer.

Visualization of the performance improvement with this optimization

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule

suggestions with a side of scope creep :D

cpp/src/io/json/read_json.cu

…byte-range-improvement

copy-pr-bot · 2024-04-05T00:22:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shrshi · 2024-04-05T00:23:58Z

/ok to test

shrshi · 2024-04-05T10:31:36Z

/ok to test

shrshi · 2024-04-05T10:49:37Z

/ok to test

shrshi · 2024-04-05T10:50:04Z

/ok to test

shrshi · 2024-04-23T23:45:55Z

/ok to test

…byte-range-improvement

shrshi · 2024-04-24T17:35:43Z

/ok to test

shrshi · 2024-04-25T21:19:34Z

/ok to test

mythrocks · 2024-04-26T23:16:30Z

cpp/src/io/json/json_normalization.cu

+  datasource::owning_buffer<rmm::device_uvector<char>> outdata(std::move(outbuf));
+  std::swap(indata, outdata);


This looks like it's significant, but I didn't grasp it.
What's the advantage of swapping in place over, say, assigning to indata?

My reasoning here was that after the RMM buffer in indata datasource has been normalized, we can discard the buffer. Rather than copying outdata to indata with the implicitly generated copy assignment operator in owning_buffer, I thought swapping it would be faster.

cpp/src/io/json/json_normalization.cu

cpp/src/io/json/read_json.cu

mythrocks

Some minor nitpicks, and a clarifying question.

hyperbolic2346

A few nits and questions

cpp/src/io/json/json_normalization.cu

cpp/src/io/json/read_json.cu

hyperbolic2346 · 2024-04-29T17:13:38Z

cpp/src/io/json/read_json.cu

+  // of subchunks.
+  size_t buffer_size =
+    reader_compression != compression_type::NONE
+      ? total_source_size * compression_ratio + 4096


Why are we adding 4096? Is that for headers?

Yes, 4096 is for the headers. I've used the uncompressed buffer size estimate from

cudf/cpp/src/io/comp/uncomp.cpp

Line 361 in 064dd7b

uncomp_len = comp_len * 4 + 4096; // In case uncompressed size isn't known in advance, assume

Would be nice to have this guess defined instead of a hard-coded value. I'm ok either way though.

cpp/src/io/json/read_json.cu

…byte-range-improvement

shrshi · 2024-04-29T21:24:11Z

/ok to test

shrshi · 2024-04-29T21:29:03Z

/ok to test

hyperbolic2346

Just nits from me this time.

cpp/src/io/json/read_json.cu

…byte-range-improvement

shrshi · 2024-04-30T17:15:16Z

/ok to test

shrshi · 2024-04-30T18:33:49Z

/merge

byte range reader improvement

697cf65

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 26, 2024

shrshi added feature request New feature or request Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed feature request New feature or request labels Mar 26, 2024

shrshi added 2 commits March 27, 2024 00:39

subchunk size heuristic; multistream d2d copy; small logic fix

115c2c6

Merge branch 'branch-24.06' into byte-range-improvement

c99e4ef

shrshi marked this pull request as ready for review March 27, 2024 16:02

shrshi requested a review from a team as a code owner March 27, 2024 16:03

shrshi requested review from hyperbolic2346, mythrocks and vuule March 27, 2024 16:03

Merge branch 'branch-24.06' into byte-range-improvement

7f97196

vuule reviewed Mar 27, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

shrshi added 5 commits April 1, 2024 17:35

Merge branch 'branch-24.06' into byte-range-improvement

615f005

overhaul commit

0ac251d

Merge branch 'branch-24.06' into byte-range-improvement

5c21ee4

format fix

09641db

Merge branch 'byte-range-improvement' of github.com:shrshi/cudf into …

c186435

…byte-range-improvement

shrshi marked this pull request as draft April 5, 2024 10:21

more fixes

e912671

cleanup

8557cf9

Merge branch 'branch-24.06' into byte-range-improvement

16f7e7f

shrshi added 2 commits April 23, 2024 23:43

addressing PR reviews

5dc53d8

Merge branch 'branch-24.06' into byte-range-improvement

d29fdf8

shrshi added 2 commits April 24, 2024 17:17

fix

bd18397

Merge branch 'byte-range-improvement' of github.com:shrshi/cudf into …

a5e49af

…byte-range-improvement

Merge branch 'branch-24.06' into byte-range-improvement

54daff2

mythrocks reviewed Apr 26, 2024

View reviewed changes

cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved

mythrocks reviewed Apr 26, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

mythrocks reviewed Apr 26, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

mythrocks reviewed Apr 26, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

mythrocks reviewed Apr 27, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

mythrocks approved these changes Apr 27, 2024

View reviewed changes

hyperbolic2346 reviewed Apr 29, 2024

View reviewed changes

shrshi added 3 commits April 29, 2024 19:50

partially addressing reviews

9075159

PR reviews

7d826af

Merge branch 'byte-range-improvement' of github.com:shrshi/cudf into …

fb9fdae

…byte-range-improvement

Merge branch 'branch-24.06' into byte-range-improvement

032a5da

shrshi requested a review from hyperbolic2346 April 29, 2024 23:16

hyperbolic2346 approved these changes Apr 30, 2024

View reviewed changes

shrshi added 2 commits April 30, 2024 17:13

adding consts

329f9ae

Merge branch 'byte-range-improvement' of github.com:shrshi/cudf into …

df68938

…byte-range-improvement

rapids-bot bot merged commit f3206ea into rapidsai:branch-24.06 Apr 30, 2024
70 checks passed

vuule mentioned this pull request Nov 5, 2024

Reduce IO when byte_range option is used in read_json #15185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing multi-source byte range reading in JSON reader #15396

Optimizing multi-source byte range reading in JSON reader #15396

shrshi commented Mar 26, 2024 •

edited

Loading

vuule left a comment

copy-pr-bot bot commented Apr 5, 2024

shrshi commented Apr 5, 2024

shrshi commented Apr 5, 2024

shrshi commented Apr 5, 2024

shrshi commented Apr 5, 2024

shrshi commented Apr 23, 2024

shrshi commented Apr 24, 2024

shrshi commented Apr 25, 2024

mythrocks Apr 26, 2024

shrshi Apr 29, 2024

mythrocks left a comment

hyperbolic2346 left a comment

hyperbolic2346 Apr 29, 2024

shrshi Apr 29, 2024

hyperbolic2346 Apr 30, 2024

shrshi commented Apr 29, 2024

shrshi commented Apr 29, 2024

hyperbolic2346 left a comment

shrshi commented Apr 30, 2024

shrshi commented Apr 30, 2024

		datasource::owning_buffer<rmm::device_uvector<char>> outdata(std::move(outbuf));
		std::swap(indata, outdata);

Optimizing multi-source byte range reading in JSON reader #15396

Optimizing multi-source byte range reading in JSON reader #15396

Conversation

shrshi commented Mar 26, 2024 • edited Loading

Description

Checklist

vuule left a comment

Choose a reason for hiding this comment

copy-pr-bot bot commented Apr 5, 2024

shrshi commented Apr 5, 2024

shrshi commented Apr 5, 2024

shrshi commented Apr 5, 2024

shrshi commented Apr 5, 2024

shrshi commented Apr 23, 2024

shrshi commented Apr 24, 2024

shrshi commented Apr 25, 2024

mythrocks Apr 26, 2024

Choose a reason for hiding this comment

shrshi Apr 29, 2024

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

hyperbolic2346 Apr 29, 2024

Choose a reason for hiding this comment

shrshi Apr 29, 2024

Choose a reason for hiding this comment

hyperbolic2346 Apr 30, 2024

Choose a reason for hiding this comment

shrshi commented Apr 29, 2024

shrshi commented Apr 29, 2024

hyperbolic2346 left a comment

Choose a reason for hiding this comment

shrshi commented Apr 30, 2024

shrshi commented Apr 30, 2024

shrshi commented Mar 26, 2024 •

edited

Loading