Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet reader list microkernel #16538

Merged
merged 52 commits into from
Oct 29, 2024

Conversation

pmattione-nvidia
Copy link
Contributor

@pmattione-nvidia pmattione-nvidia commented Aug 12, 2024

This PR refactors fixed-width parquet list reader decoding into its own set of micro-kernels, templatizing the existing fixed-width microkernels. When skipping rows for lists, this will skip ahead the decoding of the definition, repetition, and dictionary rle_streams as well. The list kernel uses 128 threads per block and 71 registers per thread, so I've changed the launch_bounds to enforce a minimum of 8 blocks per SM. This causes a small register spill but the benchmarks are still faster, as seen below:

DEVICE_BUFFER list benchmarks (decompress + decode, not bound by IO):
run_length 1, cardinality 0, no byte_limit: 24.7% faster
run_length 32, cardinality 1000, no byte_limit: 18.3% faster
run_length 1, cardinality 0, 500kb byte_limit: 57% faster
run_length 32, cardinality 1000, 500kb byte_limit: 53% faster

Compressed list of ints on hard drive: 5.5% faster
Sample real data on hard drive (many columns not lists): 0.5% faster

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 12, 2024
pmattione-nvidia and others added 28 commits August 16, 2024 16:19
@pmattione-nvidia pmattione-nvidia marked this pull request as ready for review October 11, 2024 18:05
@pmattione-nvidia pmattione-nvidia requested a review from a team as a code owner October 11, 2024 18:05
@nvdbaranec
Copy link
Contributor

Seems like this is also adding list support to the split page path as well. Am I reading this right?

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Oct 17, 2024

One thing I've been thinking about is maybe splitting this file into two or three pieces.

  • One cu file containing the core loops for each of the major kernels (and the host side launch code)
  • A cuh file for the "update" functions
  • A cuh file for the "decode values" functions.

Definitely not for this PR, but something to think about down the road. I think it might help make the volume of code that has built up here more tractable.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed through it. Will get more in-depth tomorrow.

cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved
cpp/src/io/parquet/rle_stream.cuh Show resolved Hide resolved
@pmattione-nvidia
Copy link
Contributor Author

Seems like this is also adding list support to the split page path as well. Am I reading this right?

Yes.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor questions/suggestions
Not sure I could find issues with the decode algorithm :D

cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved
cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved
cpp/src/io/parquet/rle_stream.cuh Outdated Show resolved Hide resolved
@vuule vuule requested a review from nvdbaranec October 23, 2024 16:12
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Mostly just more small stuff.

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/rle_stream.cuh Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also run compute-sanitizer on the unit tests to make sure everything is good.

@pmattione-nvidia
Copy link
Contributor Author

Please also run compute-sanitizer on the unit tests to make sure everything is good.

Tests pass.

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Oct 29, 2024
@pmattione-nvidia
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit eeb4d27 into rapidsai:branch-24.12 Oct 29, 2024
102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants