[FEA] Optimization of repetition and definition level decoding in the parquet reader kernel. #12633
Labels
cuIO
cuIO issue
feature request
New feature or request
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Milestone
The parquet reader kernels (
gpuDecodePages
andgpuComputePageSizes
) are heavily bottlenecked by the repetition/definition level decoding steps.These data streams produce the information needed to decode output value indices, nulls and nesting depth information. We currently use a single warp to decode batches of the above data in a double buffered way, handing these buffers off to 3 further warps which then decode the actual output data.
The idea is that the level decoding warp can provide batches of information faster than the value decoding warps can consume them. In practice, this doesn't happen: the value decoding warps sit idle while the level decoding warp slogs through the data.
A proposed optimization here is:
Change the fundamental way the kernel is structured to a more direct producer-consumer model. The primary bottleneck is the single warp which walks through the level data. Eg:
[4 byte header + level data chunk] [4 byte header + level data chunk] ....
A better structure would be to have a single warp which produces a list of each one of the chunks and puts them in a work queue. Each of the individual chunks would then get it's own warp to do the decoding. So instead of having 1 warp decoding say 64 chunks, we could have 64 warps each decoding one chunk. There would have to be some additional coordination to determine sizes and output positions, but that would mostly be in the form of a scan/prefix-sum between the warps which is fast. In addition, if we do this, it seems possible that we could eliminate the double buffered intermediate storage of output indices - each warp decoding the level stream could also simply copy/decode the output values/validity themselves directly.
The text was updated successfully, but these errors were encountered: