[FEA] Improve ORC reader performance for decimal types #13251
Labels
0 - Backlog
In queue waiting for assignment
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Milestone
Decode of decimal files is an order of magnitude slower than decode of integral types.
The reason is the use of a single thread to find the sizes of the next batch of elements, which are then decoded using the whole block. To improve kernel performance, it needs to use multiple threads to find varint boundaries:
Pass 1: every thread runs
is_boundary_byte
(highest bit == 0) to find if it's at the last byte of a varint element.Pass 2: Scan A to produce B. Also gets the number of elements (3 in this case).
Pass 3: Threads that are on a boundary decode the element that starts at their index and store it at col[t].
t=0 writes to [0]
t=4 writes to [1]
t=7 writes to [2]
Alternatively, step 3 can store the offsets of each element so they can be decoded in parallel.
The text was updated successfully, but these errors were encountered: