Skip to content

Commit

Permalink
Address poor performance of Parquet string decoding (#15304)
Browse files Browse the repository at this point in the history
See #15297. The Parquet string decoder can become a bottleneck in the presence of strings of widely varying sizes. This PR is an attempt to address this, at least as a stop gap solution. A more complete solution may be to rework the string decoder to work in a block-wide fashion, such as the new micro-kernels added in #15159.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15304
  • Loading branch information
etseidl authored Mar 19, 2024
1 parent ea40596 commit 7cc02e5
Showing 1 changed file with 5 additions and 8 deletions.
13 changes: 5 additions & 8 deletions cpp/src/io/parquet/page_string_decode.cu
Original file line number Diff line number Diff line change
Expand Up @@ -1045,12 +1045,6 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size)
//
if (!has_repetition) { dst_pos -= s->first_row; }

// need to do this before we branch on src_pos/dst_pos so we don't deadlock
// choose a character parallel string copy when the average string is longer than a warp
using cudf::detail::warp_size;
auto const use_char_ll =
s->page.num_valids > 0 && (s->page.str_bytes / s->page.num_valids) >= warp_size;

if (me < warp_size) {
for (int i = 0; i < decode_block_size - out_thread0; i += warp_size) {
dst_pos = sb->nz_idx[rolling_index<rolling_buf_size>(src_pos + i)];
Expand All @@ -1061,10 +1055,13 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size)
: cuda::std::pair<char const*, size_t>{nullptr, 0};

__shared__ cub::WarpScan<size_type>::TempStorage temp_storage;
size_type offset;
cub::WarpScan<size_type>(temp_storage).ExclusiveSum(len, offset);
size_type offset, warp_total;
cub::WarpScan<size_type>(temp_storage).ExclusiveSum(len, offset, warp_total);
offset += last_offset;

// choose a character parallel string copy when the average string is longer than a warp
auto const use_char_ll = warp_total / warp_size >= warp_size;

if (use_char_ll) {
__shared__ __align__(8) uint8_t const* pointers[warp_size];
__shared__ __align__(4) size_type offsets[warp_size];
Expand Down

0 comments on commit 7cc02e5

Please sign in to comment.