New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Parquet reader list microkernel #16538

Merged

rapids-bot merged 52 commits into rapidsai:branch-24.12 from pmattione-nvidia:parquet_list_kernel

Oct 29, 2024

Contributor

pmattione-nvidia commented Aug 12, 2024 •

edited

Loading

This PR refactors fixed-width parquet list reader decoding into its own set of micro-kernels, templatizing the existing fixed-width microkernels. When skipping rows for lists, this will skip ahead the decoding of the definition, repetition, and dictionary rle_streams as well. The list kernel uses 128 threads per block and 71 registers per thread, so I've changed the launch_bounds to enforce a minimum of 8 blocks per SM. This causes a small register spill but the benchmarks are still faster, as seen below:

DEVICE_BUFFER list benchmarks (decompress + decode, not bound by IO):
run_length 1, cardinality 0, no byte_limit: 24.7% faster
run_length 32, cardinality 1000, no byte_limit: 18.3% faster
run_length 1, cardinality 0, 500kb byte_limit: 57% faster
run_length 32, cardinality 1000, 500kb byte_limit: 53% faster

Compressed list of ints on hard drive: 5.5% faster
Sample real data on hard drive (many columns not lists): 0.5% faster

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.


          work in progress

b5ec22e

github-actions bot added the libcudf label

pmattione-nvidia and others added 28 commits

August 16, 2024 16:19


          Further work in list code

2ca9618


          Tests working

4b5f91a


          Revert page_decode changes

ead17b8


          Merge branch 'branch-24.10' into parquet_list_kernel

cc32409


          Add debugging

0dccec5


          Tests working

e239e79


          Merge branch 'branch-24.10' into parquet_list_kernel

8f25453


          compile fixes

24c9ab1


          No need to decode def levels if not nullable

342c2f4


          Manual block scan

50bbc94


          Optimize parquet reader block scans, simplify and consolidate non-nul…

…lable column code


          tweak syncing

3ef7b0d


          small tweaks


          Merge branch 'branch-24.10' into parquet_list_kernel


          Add skipping to rle_stream, use for lists (chunked reads)

e285fbf


          tweak scan interface for linked lists

254f3e9


          Merge branch 'branch-24.12' into mukernels_fixedwidth_optimize

18d989c


          style fixes

8ea1e0e


          Merge branch 'mukernels_fixedwidth_optimize' of https://github.com/pm…

326b386

…attione-nvidia/cudf into mukernels_fixedwidth_optimize


          Update cpp/src/io/parquet/decode_fixed.cu

41cb982

Co-authored-by: nvdbaranec <[email protected]>


          Update cpp/src/io/parquet/decode_fixed.cu

6e70554

Co-authored-by: nvdbaranec <[email protected]>


          Update cpp/src/io/parquet/decode_fixed.cu

9ad4415

Co-authored-by: nvdbaranec <[email protected]>


          Unroll block-count loop

3a1fc95


          Merge branch 'mukernels_fixedwidth_optimize' of https://github.com/pm…

0babf46

…attione-nvidia/cudf into mukernels_fixedwidth_optimize


          more style fixes

5ab9829


          Merge branch 'branch-24.12' into mukernels_fixedwidth_optimize

310d50c


          Disable manual block scan for non-lists


          Update cpp/src/io/parquet/decode_fixed.cu

c0ed2cb

Co-authored-by: Vukasin Milovanovic <[email protected]>

pmattione-nvidia and others added 2 commits

October 11, 2024 12:49


          revert cmakelists change

e51406c


          Merge branch 'branch-24.12' into parquet_list_kernel

0237e5c

pmattione-nvidia marked this pull request as ready for review

October 11, 2024 18:05

pmattione-nvidia requested a review from a team as a code owner

October 11, 2024 18:05

pmattione-nvidia requested review from mythrocks, vuule and nvdbaranec

October 11, 2024 18:05

Contributor

nvdbaranec commented Oct 17, 2024

Seems like this is also adding list support to the split page path as well. Am I reading this right?

Contributor

nvdbaranec commented Oct 17, 2024 •

edited

Loading

One thing I've been thinking about is maybe splitting this file into two or three pieces.

One cu file containing the core loops for each of the major kernels (and the host side launch code)
A cuh file for the "update" functions
A cuh file for the "decode values" functions.

Definitely not for this PR, but something to think about down the road. I think it might help make the volume of code that has built up here more tractable.

nvdbaranec requested changes

View reviewed changes

Contributor

nvdbaranec left a comment

Skimmed through it. Will get more in-depth tomorrow.

cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved

cpp/src/io/parquet/rle_stream.cuh Show resolved Hide resolved


          Update cpp/src/io/parquet/rle_stream.cuh

07ffbf2

Co-authored-by: nvdbaranec <[email protected]>

Contributor Author

pmattione-nvidia commented Oct 18, 2024

Seems like this is also adding list support to the split page path as well. Am I reading this right?

Yes.


          refactor rle_stream

32fe8b9

mythrocks reviewed

View reviewed changes

cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved

vuule reviewed

View reviewed changes

Contributor

vuule left a comment

Few minor questions/suggestions
Not sure I could find issues with the decode algorithm :D

cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved

cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved

cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved


          Use divide function

031ac6b

vuule requested a review from nvdbaranec

October 23, 2024 16:12


          Merge branch 'branch-24.12' into parquet_list_kernel

a82ae40

nvdbaranec requested changes

View reviewed changes

Contributor

nvdbaranec left a comment

Looks great. Mostly just more small stuff.

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/rle_stream.cuh Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

pmattione-nvidia added 2 commits

October 25, 2024 16:06


          address comments

534e67d


          Merge branch 'parquet_list_kernel' of https://github.com/pmattione-nv…

45c00b8

…idia/cudf into parquet_list_kernel

vuule approved these changes

View reviewed changes

nvdbaranec approved these changes

View reviewed changes

ttnghia approved these changes

View reviewed changes

Contributor

ttnghia left a comment

Please also run compute-sanitizer on the unit tests to make sure everything is good.


          Change scan interface to pass in shared memory to avoid sync issues

a6adb0d

Contributor Author

pmattione-nvidia commented Oct 28, 2024

Please also run compute-sanitizer on the unit tests to make sure everything is good.

Tests pass.


          switch to sharing memory between scans

f4aedb9

vuule added the 5 - Ready to Merge label

Contributor Author

pmattione-nvidia commented Oct 29, 2024

/merge

rapids-bot bot merged commit eeb4d27 into rapidsai:branch-24.12

102 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge improvement libcudf non-breaking Performance