Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve ORC reader performance for decimal types #13251

Open
vuule opened this issue Apr 29, 2023 · 2 comments
Open

[FEA] Improve ORC reader performance for decimal types #13251

vuule opened this issue Apr 29, 2023 · 2 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@vuule
Copy link
Contributor

vuule commented Apr 29, 2023

Decode of decimal files is an order of magnitude slower than decode of integral types.
The reason is the use of a single thread to find the sizes of the next batch of elements, which are then decoded using the whole block. To improve kernel performance, it needs to use multiple threads to find varint boundaries:

Pass 1: every thread runs is_boundary_byte (highest bit == 0) to find if it's at the last byte of a varint element.

A = 0 0 0 0 1 0 0 1 0 0 0 1

Pass 2: Scan A to produce B. Also gets the number of elements (3 in this case).

B = 0 0 0 0 1 1 1 2 2 2 2 3
    ^       ^     ^
    t0      t4    t7

Pass 3: Threads that are on a boundary decode the element that starts at their index and store it at col[t].
t=0 writes to [0]
t=4 writes to [1]
t=7 writes to [2]
Alternatively, step 3 can store the offsets of each element so they can be decoded in parallel.

@vuule vuule added cuIO cuIO issue Performance Performance related issue labels Apr 29, 2023
@vuule
Copy link
Contributor Author

vuule commented Apr 29, 2023

Additional optimization:

  • divide your threads into 2 chunks based on the average length of a varint.
  • Let's say 2 bytes / 1 varint. 2:1
  • Divide your block of say 768 threads into 2 chunks of 512 and 256.
  • Overlap generation of the next set of offsets (512 threads) with decoding the last set (256 threads).

@GregoryKimball
Copy link
Contributor

Also see #12677 for profiling examples

@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Jul 5, 2023
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. 0 - Backlog In queue waiting for assignment labels Jul 5, 2023
@GregoryKimball GregoryKimball added the feature request New feature or request label Jul 10, 2023
@GregoryKimball GregoryKimball changed the title Improve ORC reader performance for decimal types [FEA] Improve ORC reader performance for decimal types Jul 10, 2023
@vuule vuule assigned vyasr and vuule Aug 21, 2023
@GregoryKimball GregoryKimball removed the status in libcudf Sep 25, 2023
@GregoryKimball GregoryKimball moved this to In progress in libcudf Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
Status: In progress
Development

No branches or pull requests

3 participants