Parallelize `gpuInitStringDescriptors` for fixed length byte array data #16109

mhaseeb123 · 2024-06-27T01:57:42Z

Description

This PR parallelizes the gpuInitStringDescriptors function for the fixed length byte array (FLBA) data at either warp or thread block level via cooperative groups. The function continues to execute serially (thread rank 0 in the group) for variable length arrays.

CC: @etseidl

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

mhaseeb123 · 2024-06-27T04:56:31Z

Performance Improvement

Marginal only. Comparison ran several times to and similar improvements seen.

Measured by running Nsight systems on the Gtest ParquetWriterTest.WriteFixedLenByteArray with: constexpr cudf::size_type num_rows = 80'000'000;.

Testbed:

NVIDIA RTX 5880 Ada Generation
AMD Ryzen Threadripper PRO 5975WX 32-Cores
NVIDIA-SMI 550.67
Driver Version: 550.67
CUDA Version: 12.4
Devcontainer: cuda-12.2-pip

Old:

Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)                                                   Name   
8.3       34,382,525          1   34,382,525.0   34,382,525.0   34,382,525   34,382,525           0.0  void cudf::io::parquet::detail::<unnamed>::gpuDecodeStringPageData<unsigned char>(cudf::io::parquet…

New:

Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)                                                   Name                                                
8.3       34,268,127          1   34,268,127.0   34,268,127.0   34,268,127   34,268,127           0.0  void cudf::io::parquet::detail::<unnamed>::gpuDecodeStringPageData<unsigned char>(cudf::io::parquet…

cpp/src/io/parquet/page_decode.cuh

vuule

looks good, few optional suggestions.

cpp/src/io/parquet/page_data.cu

vuule · 2024-07-09T06:23:43Z

cpp/src/io/parquet/page_data.cu

@@ -277,6 +279,7 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size)
    }
    // this needs to be here to prevent warp 3 modifying src_pos before all threads have read it
    __syncthreads();
+    auto const tile32 = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block());
    if (t < 32) {


would be nice to use tile32 here since we already have it, but I'm not convinced it can be done in a simple way.

Yes, you are right that we could replace this t < 32 with tile_warp.meta_group_rank() == 0 and it should be good but the logic at L289 is messier to replace since it may be tile_warp.meta_group_rank() == 0 or 1 depending on out_threads0 == 32 or 64 so I left this as is for simplicity.

We could/should probably port the whole logic to thread_groups and avoid the magic 32 multiples. I'd expect that it would not be more complex than the current logic.
Not something for this PR.

cpp/src/io/parquet/page_string_decode.cu

shrshi

One clarifying question but looks good to me otherwise!

cpp/src/io/parquet/page_decode.cuh

cpp/src/io/parquet/page_string_decode.cu

Co-authored-by: Vukasin Milovanovic <[email protected]> Co-authored-by: Shruti Shivakumar <[email protected]>

cpp/src/io/parquet/page_decode.cuh

Co-authored-by: Yunsong Wang <[email protected]>

mhaseeb123 · 2024-07-09T21:05:07Z

/merge

parallelize gpuInitStringDescriptors for FLBA data

f34b5e9

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jun 27, 2024

mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 27, 2024

mhaseeb123 and others added 2 commits June 26, 2024 21:22

Merge branch 'rapidsai:branch-24.08' into paralllel-init-str-descriptors

7cc5cf8

minor improvements

b8a1ae3

mhaseeb123 self-assigned this Jun 27, 2024

mhaseeb123 marked this pull request as ready for review June 27, 2024 18:08

mhaseeb123 requested a review from a team as a code owner June 27, 2024 18:08

mhaseeb123 requested review from karthikeyann, shrshi, vuule and PointKernel June 27, 2024 18:08

mhaseeb123 and others added 2 commits June 27, 2024 11:12

Merge branch 'branch-24.08' into paralllel-init-str-descriptors

0024713

Add const where possible

65e74c1

mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 27, 2024

mhaseeb123 commented Jun 27, 2024

View reviewed changes

cpp/src/io/parquet/page_decode.cuh Outdated Show resolved Hide resolved

mhaseeb123 added 2 commits June 27, 2024 20:09

remove temp variable len and directly assign to str_len

423985a

Simplify dict_idx and k computation.

d4ee15f

vuule approved these changes Jul 9, 2024

View reviewed changes

shrshi reviewed Jul 9, 2024

View reviewed changes

cpp/src/io/parquet/page_decode.cuh Show resolved Hide resolved

cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved

mhaseeb123 and others added 3 commits July 9, 2024 10:49

Apply suggestions from code review

c674b4a

Co-authored-by: Vukasin Milovanovic <[email protected]> Co-authored-by: Shruti Shivakumar <[email protected]>

Rename tile32 to tile_warp

d207dc5

Rename tile32 to tile_warp

64f36ba

PointKernel reviewed Jul 9, 2024

View reviewed changes

cpp/src/io/parquet/page_decode.cuh Outdated Show resolved Hide resolved

Pass cg via const reference instead of value

4d20aeb

Co-authored-by: Yunsong Wang <[email protected]>

Merge branch 'branch-24.08' into paralllel-init-str-descriptors

fb1ac34

shrshi approved these changes Jul 9, 2024

View reviewed changes

mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 9, 2024

rapids-bot bot merged commit 7cc01be into rapidsai:branch-24.08 Jul 9, 2024
80 checks passed

mhaseeb123 deleted the paralllel-init-str-descriptors branch July 9, 2024 21:05

mhaseeb123 mentioned this pull request Jul 9, 2024

[FEA] Port the logic at page_data.cu:282 to use thread_groups and avoid the magic 32 multiples. #16235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize `gpuInitStringDescriptors` for fixed length byte array data #16109

Parallelize `gpuInitStringDescriptors` for fixed length byte array data #16109

mhaseeb123 commented Jun 27, 2024 •

edited

Loading

mhaseeb123 commented Jun 27, 2024 •

edited

Loading

vuule left a comment

vuule Jul 9, 2024

mhaseeb123 Jul 9, 2024 •

edited

Loading

vuule Jul 9, 2024

shrshi left a comment

mhaseeb123 commented Jul 9, 2024

Parallelize gpuInitStringDescriptors for fixed length byte array data #16109

Parallelize gpuInitStringDescriptors for fixed length byte array data #16109

Conversation

mhaseeb123 commented Jun 27, 2024 • edited Loading

Description

Checklist

mhaseeb123 commented Jun 27, 2024 • edited Loading

Performance Improvement

Old:

New:

vuule left a comment

Choose a reason for hiding this comment

vuule Jul 9, 2024

Choose a reason for hiding this comment

mhaseeb123 Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

vuule Jul 9, 2024

Choose a reason for hiding this comment

shrshi left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Jul 9, 2024

Parallelize `gpuInitStringDescriptors` for fixed length byte array data #16109

Parallelize `gpuInitStringDescriptors` for fixed length byte array data #16109

mhaseeb123 commented Jun 27, 2024 •

edited

Loading

mhaseeb123 commented Jun 27, 2024 •

edited

Loading

mhaseeb123 Jul 9, 2024 •

edited

Loading