Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize gpuInitStringDescriptors for fixed length byte array data #16109

Merged

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Jun 27, 2024

Description

Closes #14113

This PR parallelizes the gpuInitStringDescriptors function for the fixed length byte array (FLBA) data at either warp or thread block level via cooperative groups. The function continues to execute serially (thread rank 0 in the group) for variable length arrays.

CC: @etseidl

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jun 27, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 27, 2024
@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Jun 27, 2024

Performance Improvement

Marginal only. Comparison ran several times to and similar improvements seen.

Measured by running Nsight systems on the Gtest ParquetWriterTest.WriteFixedLenByteArray with: constexpr cudf::size_type num_rows = 80'000'000;.

Testbed:

NVIDIA RTX 5880 Ada Generation
AMD Ryzen Threadripper PRO 5975WX 32-Cores
NVIDIA-SMI 550.67
Driver Version: 550.67
CUDA Version: 12.4
Devcontainer: cuda-12.2-pip

Old:

Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)                                                   Name   
8.3       34,382,525          1   34,382,525.0   34,382,525.0   34,382,525   34,382,525           0.0  void cudf::io::parquet::detail::<unnamed>::gpuDecodeStringPageData<unsigned char>(cudf::io::parquet…

New:

Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)                                                   Name                                                
8.3       34,268,127          1   34,268,127.0   34,268,127.0   34,268,127   34,268,127           0.0  void cudf::io::parquet::detail::<unnamed>::gpuDecodeStringPageData<unsigned char>(cudf::io::parquet…

@mhaseeb123 mhaseeb123 self-assigned this Jun 27, 2024
@mhaseeb123 mhaseeb123 marked this pull request as ready for review June 27, 2024 18:08
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner June 27, 2024 18:08
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 27, 2024
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, few optional suggestions.

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
@@ -277,6 +279,7 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size)
}
// this needs to be here to prevent warp 3 modifying src_pos before all threads have read it
__syncthreads();
auto const tile32 = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block());
if (t < 32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to use tile32 here since we already have it, but I'm not convinced it can be done in a simple way.

Copy link
Member Author

@mhaseeb123 mhaseeb123 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right that we could replace this t < 32 with tile_warp.meta_group_rank() == 0 and it should be good but the logic at L289 is messier to replace since it may be tile_warp.meta_group_rank() == 0 or 1 depending on out_threads0 == 32 or 64 so I left this as is for simplicity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could/should probably port the whole logic to thread_groups and avoid the magic 32 multiples. I'd expect that it would not be more complex than the current logic.
Not something for this PR.

cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One clarifying question but looks good to me otherwise!

cpp/src/io/parquet/page_decode.cuh Show resolved Hide resolved
cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved
@mhaseeb123 mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 9, 2024
@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 7cc01be into rapidsai:branch-24.08 Jul 9, 2024
80 checks passed
@mhaseeb123 mhaseeb123 deleted the paralllel-init-str-descriptors branch July 9, 2024 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Parallelize gpuInitStringDescriptors when Parquet input type is FIXED_LEN_BYTE_ARRAY
4 participants