Compute column sizes in Parquet preprocess with single kernel #12931

SrikarVanavasam · 2023-03-13T20:21:27Z

Addresses #11922

Currently in Parquet preprocessing a thrust::reduce() and thrust::exclusive_scan_by_key() is performed to compute the column size and offsets for each nested column. For complicated schemas this results in a large number of kernel invocations. This PR calculates the sizes and offsets of all columns in single calls to thrust::reduce_by_key() and thrust::exclusive_scan_by_key().

This change results in around 1.3x speedup when reading a complicated schema.
Before:

After:

GregoryKimball · 2023-03-15T17:10:57Z

Would it be accurate to say that column size computation is 4x faster for this schema? Would you please share a few details about the schema in your test case?

SrikarVanavasam · 2023-03-15T18:05:29Z

Would it be accurate to say that column size computation is 4x faster for this schema? Would you please share a few details about the schema in your test case?

The highlighted region in the timelines includes the memsets for allocating the columns as well so the just the columns size and offset computation is actually many times faster in this case since, but I think it would be fair to say column allocation overall is 4x faster for this schema. This schema has 1629 columns with up to 8 levels of nesting which is why there were so many thrust::reduce() and thrust::exclusive_scan_by_key() calls before.

divyegala

I don't have enough domain knowledge to review this PR. The code looks fine to me, but someone else will need to verify the logic

cpp/src/io/parquet/reader_impl_preprocess.cu

PointKernel

Looks good in general with some small questions/suggestions

cpp/src/io/parquet/reader_impl_preprocess.cu

vuule

Impressive speedups!
Some surface-level comments, did not go into the actual algorithm changes yet.
Echoing @PointKernel 's observations about mixed integral types.

cpp/src/io/parquet/reader_impl_preprocess.cu

ttnghia · 2023-03-23T04:50:50Z

cpp/src/io/parquet/reader_impl_preprocess.cu

+    thrust::reduce_by_key(rmm::exec_policy(_stream),
+                          reduction_keys,
+                          reduction_keys + num_keys,
+                          size_input,
+                          thrust::make_discard_iterator(),
+                          sizes.d_begin());
+
+    // for nested hierarchies, compute per-page start offset
+    thrust::exclusive_scan_by_key(rmm::exec_policy(_stream),
+                                  reduction_keys,
+                                  reduction_keys + num_keys,
+                                  size_input,
+                                  start_offset_output_iterator{pages.device_ptr(),
+                                                               page_index.begin(),
+                                                               0,
+                                                               input_cols.device_ptr(),
+                                                               max_depth,
+                                                               pages.size()});


I wonder if we can even do better by combining these 2 kernel calls? Since they are operating on the same reduction_keys. Maybe just one reduce_by_key with a custom device lambda/functor that can do both reduce and scan?

I'm not quite sure how to create a functor for reduce_by_key that would also perform the scan but it could be possible. Do you have an idea in mind?

karthikeyann

Prefer auto and const everywhere. Same comments.

cpp/src/io/parquet/reader_impl_preprocess.cu

vuule · 2023-03-28T06:42:56Z

@SrikarVanavasam there are failing Parquet tests with the current version. Looks like it's related to the review changes, since CI passed with e8b7c24
Edit: can't find what could be causing the regression, the last commit looks good.

nvdbaranec

This looks good to me. Only minor thing I'd bring up is that in the reduction_indices struct I'd suggest using the underscores for the constructor parameters, and keeping the fields of the struct itself without them.

I like the abstraction of the reduction_indices itself - clarifies how we're organizing the keys.

Also, if there's a bug that got introduced during PR changes, be sure to also re-run the Spark integration tests after fixing.

cpp/src/io/parquet/reader_impl_preprocess.cu

…am/cudf into parquet_preprocess

…ocess

cpp/src/io/parquet/reader_impl_preprocess.cu

PointKernel

I'm happy with the current state of this PR. Thanks for your effort in addressing the review comments!

vuule · 2023-04-07T04:15:06Z

/merge

SrikarVanavasam added 4 commits February 16, 2023 20:05

nested column size calculation into one kernel

c863d77

Merge branch 'branch-23.04' into parquet_preprocess

81b5b06

use hostdevice_vector

222f031

added get_indeces

17cba01

SrikarVanavasam requested a review from a team as a code owner March 13, 2023 20:21

SrikarVanavasam requested review from karthikeyann and divyegala March 13, 2023 20:21

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 13, 2023

SrikarVanavasam added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change 3 - Ready for Review Ready for review by team labels Mar 13, 2023

Merge branch 'rapidsai:branch-23.04' into parquet_preprocess

f624326

SrikarVanavasam changed the title ~~Compute column sizes in Parquet preprocess in single kernel~~ Compute column sizes in Parquet preprocess with single kernel Mar 14, 2023

divyegala reviewed Mar 21, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

Fix comment

ebb2c22

GregoryKimball assigned SrikarVanavasam Mar 22, 2023

PointKernel reviewed Mar 22, 2023

View reviewed changes

vuule requested changes Mar 22, 2023

View reviewed changes

ttnghia reviewed Mar 23, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

ttnghia reviewed Mar 23, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

ttnghia reviewed Mar 23, 2023

View reviewed changes

karthikeyann reviewed Mar 24, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

SrikarVanavasam added 2 commits March 27, 2023 17:55

Reviewer comments

7448aa5

Merge branch 'rapidsai:branch-23.04' into parquet_preprocess

e8b7c24

SrikarVanavasam requested review from vuule and ttnghia March 28, 2023 00:57

PointKernel reviewed Mar 28, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

nvdbaranec reviewed Mar 29, 2023

View reviewed changes

ttnghia reviewed Mar 29, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

SrikarVanavasam changed the base branch from branch-23.04 to branch-23.06 March 31, 2023 22:40

SrikarVanavasam added 6 commits April 3, 2023 18:51

review comments

b7f9deb

Merge branch 'parquet_preprocess' of https://github.com/SrikarVanavas…

371b5ed

…am/cudf into parquet_preprocess

Merge remote-tracking branch 'origin/branch-23.06' into parquet_prepr…

ed67d73

…ocess

remove comment

8376e4e

Format

dd4ea71

format

5a7a51a

SrikarVanavasam requested review from ttnghia and PointKernel April 4, 2023 14:10

east const

42c7945

ttnghia reviewed Apr 4, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

ttnghia reviewed Apr 4, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

ttnghia reviewed Apr 4, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Show resolved Hide resolved

ttnghia reviewed Apr 4, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

ttnghia reviewed Apr 4, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Show resolved Hide resolved

ttnghia reviewed Apr 4, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Show resolved Hide resolved

reviews

84b2a0a

PointKernel approved these changes Apr 6, 2023

View reviewed changes

SrikarVanavasam added 3 commits April 6, 2023 13:35

size_input with transform

2beac8c

Merge branch 'branch-23.06' into parquet_preprocess

37c7c34

Merge branch 'branch-23.06' into parquet_preprocess

e99dbbc

SrikarVanavasam requested a review from ttnghia April 7, 2023 02:51

ttnghia approved these changes Apr 7, 2023

View reviewed changes

ttnghia requested a review from nvdbaranec April 7, 2023 04:11

vuule approved these changes Apr 7, 2023

View reviewed changes

rapids-bot bot merged commit f328b64 into rapidsai:branch-23.06 Apr 7, 2023

vyasr mentioned this pull request Apr 11, 2023

[FEA] The final step of the parquet preprocess step can get very slow for highly complicated schemas. #11922

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute column sizes in Parquet preprocess with single kernel #12931

Compute column sizes in Parquet preprocess with single kernel #12931

SrikarVanavasam commented Mar 13, 2023

GregoryKimball commented Mar 15, 2023

SrikarVanavasam commented Mar 15, 2023

divyegala left a comment

PointKernel left a comment

vuule left a comment

ttnghia Mar 23, 2023 •

edited

Loading

SrikarVanavasam Mar 28, 2023

karthikeyann left a comment

vuule commented Mar 28, 2023 •

edited

Loading

nvdbaranec left a comment •

edited

Loading

PointKernel left a comment

vuule commented Apr 7, 2023

Compute column sizes in Parquet preprocess with single kernel #12931

Compute column sizes in Parquet preprocess with single kernel #12931

Conversation

SrikarVanavasam commented Mar 13, 2023

GregoryKimball commented Mar 15, 2023

SrikarVanavasam commented Mar 15, 2023

divyegala left a comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

ttnghia Mar 23, 2023 • edited Loading

Choose a reason for hiding this comment

SrikarVanavasam Mar 28, 2023

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

vuule commented Mar 28, 2023 • edited Loading

nvdbaranec left a comment • edited Loading

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

vuule commented Apr 7, 2023

ttnghia Mar 23, 2023 •

edited

Loading

vuule commented Mar 28, 2023 •

edited

Loading

nvdbaranec left a comment •

edited

Loading