Optimize ORC reader performance for list data #13708

vyasr · 2023-07-17T18:08:38Z

Description

For list types the ORC reader needs to generate offsets from the sizes of nested lists. This process was previously being parallelized over columns. In practice even with wide tables we have enough rows that parallelizing over rows always makes more sense, so this PR swaps the parallelization strategy.

I also removed what appears to be an unnecessary stream synchronization. That likely won't affect performance in any microbenchmarks but is worthwhile in case it helps improve asynchronous execution overall.

There are still noticeable bottlenecks for deeply nested lists, but those are in the decode kernels so optimizing them is a separate task for future work.

Resolves #13674

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule

Great stuff!
Requesting a change for a suspected simplification.

cpp/src/io/orc/reader_impl.cu

vyasr · 2023-07-19T19:32:08Z

Moving benchmarks to a separate thread to avoid polluting the Git log when the PR merges:

Test script from #13674

# Old implementation
parquet (numrows 1000) 0.0062103271484375 seconds                                                                                                                                                          
orc (numrows 1000) 0.0033025741577148438 seconds                                                                                                                                                           
parquet (numrows 50000000) 0.016933441162109375 seconds                                                                                                                                                    
orc (numrows 50000000) 1.1314363479614258 seconds                      

# New implementation
parquet (numrows 1000) 0.0062923431396484375 seconds                                                                                                                                                       
orc (numrows 1000) 0.003666400909423828 seconds                                                                                                                                                            
parquet (numrows 50000000) 0.016930341720581055 seconds                                                                                                                                                    
orc (numrows 50000000) 0.045706748962402344 seconds

The changes clearly close the gap for the above case.

libcudf ORC reader benchmarks

# Old implementation
  |            LIST | DEVICE_BUFFER |           0 |          1 |     10x |  52.919 ms | 0.06% |  52.913 ms | 0.06% |      10146288118 |         1.005 GiB |       488.481 MiB |                            
  |            LIST | DEVICE_BUFFER |        1000 |          1 |      6x |  87.514 ms | 0.45% |  87.508 ms | 0.45% |       6135114527 |         1.014 GiB |       363.095 MiB |                            
  |            LIST | DEVICE_BUFFER |           0 |         32 |      8x |  67.902 ms | 0.12% |  67.897 ms | 0.12% |       7907179514 |       574.083 MiB |        34.179 MiB |                            
  |            LIST | DEVICE_BUFFER |        1000 |         32 |      8x |  70.422 ms | 0.09% |  70.416 ms | 0.09% |       7624239955 |       572.610 MiB |        23.229 MiB |    


# New implementation
  |            LIST | DEVICE_BUFFER |           0 |          1 |     12x |  45.033 ms | 0.09% |  45.028 ms | 0.09% |      11923067760 |         1.005 GiB |       488.481 MiB |                            
  |            LIST | DEVICE_BUFFER |        1000 |          1 |      7x |  79.599 ms | 0.49% |  79.593 ms | 0.49% |       6745162704 |         1.014 GiB |       363.095 MiB |                            
  |            LIST | DEVICE_BUFFER |           0 |         32 |      9x |  60.006 ms | 0.12% |  60.000 ms | 0.12% |       8947826725 |       574.083 MiB |        34.179 MiB |                            
  |            LIST | DEVICE_BUFFER |        1000 |         32 |      8x |  62.578 ms | 0.18% |  62.573 ms | 0.18% |       8579974589 |       572.610 MiB |        23.229 MiB |

The ORC reader benchmarks above show a marked improvement for the benchmarked list data. All other benchmarks in that file remain unchanged.

To also test the behavior of the new code on wide tables, I reran the test script from #13674, except instead of using tall tables I used wide tables containing varying number of columns.

Wide table

# Old code
orc (numrows 1, numcols 10) 0.00503087043762207 seconds
orc (numrows 1, numcols 100) 0.012778520584106445 seconds
orc (numrows 1, numcols 1000) 0.1071467399597168 seconds
orc (numrows 1, numcols 2000) 0.21944570541381836 seconds

orc (numrows 1000, numcols 10) 0.015061140060424805 seconds
orc (numrows 1000, numcols 100) 0.012740135192871094 seconds
orc (numrows 1000, numcols 1000) 0.10806846618652344 seconds
orc (numrows 1000, numcols 2000) 0.2716226577758789 seconds

orc (numrows 10000, numcols 10) 0.016198396682739258 seconds
orc (numrows 10000, numcols 100) 0.013859033584594727 seconds
orc (numrows 10000, numcols 1000) 0.11646103858947754 seconds
orc (numrows 10000, numcols 2000) 0.23884344100952148 seconds

orc (numrows 100000, numcols 10) 0.02451324462890625 seconds
orc (numrows 100000, numcols 100) 0.02280116081237793 seconds
orc (numrows 100000, numcols 1000) 0.19908523559570312 seconds
orc (numrows 100000, numcols 2000) 0.43782830238342285 seconds

# New code
orc (numrows 1, numcols 10) 0.004733562469482422 seconds
orc (numrows 1, numcols 100) 0.014241456985473633 seconds
orc (numrows 1, numcols 1000) 0.12213611602783203 seconds
orc (numrows 1, numcols 2000) 0.25175976753234863 seconds

orc (numrows 1000, numcols 10) 0.014937639236450195 seconds
orc (numrows 1000, numcols 100) 0.01426076889038086 seconds
orc (numrows 1000, numcols 1000) 0.12375545501708984 seconds
orc (numrows 1000, numcols 2000) 0.30136775970458984 seconds

orc (numrows 10000, numcols 10) 0.016460657119750977 seconds
orc (numrows 10000, numcols 100) 0.015311002731323242 seconds
orc (numrows 10000, numcols 1000) 0.13204264640808105 seconds
orc (numrows 10000, numcols 2000) 0.2677123546600342 seconds

orc (numrows 100000, numcols 10) 0.0243988037109375 seconds
orc (numrows 100000, numcols 100) 0.02396106719970703 seconds
orc (numrows 100000, numcols 1000) 0.214888334274292 seconds
orc (numrows 100000, numcols 2000) 0.4545779228210449 seconds

In this case it looks like the new code is a bit slower as expected. Given that the slowdown is on the order of a factor of 2 at worst whereas the improvements in the other cases are a couple orders of magnitude, the tradeoff seems more than worth it. In the future we could explore dispatching to different algorithms based on table width, but we'd want to come up with reliable heuristics first.

cpp/src/io/orc/reader_impl.cu

vuule

Looks good. Just the few posted comments.

cpp/src/io/orc/reader_impl.cu

vyasr · 2023-07-25T23:21:12Z

In the interest of wrapping things up in time for code freeze, I'm going to merge this since there are two approvals and @karthikeyann's requests have been addressed. Happy to do a follow-up if you have more requests afterwards though @karthikeyann.

vyasr · 2023-07-25T23:21:17Z

/merge

Requests were addressed

vyasr added 3 commits July 14, 2023 18:11

One scan per column (with lots of commented code from testing)

e99e365

cleanup

561db9d

Remove unnecessary sync

f2c4a38

vyasr added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 17, 2023

vyasr self-assigned this Jul 17, 2023

vuule requested changes Jul 18, 2023

View reviewed changes

cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved

cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved

vyasr added 2 commits July 19, 2023 11:51

Merge remote-tracking branch 'origin/branch-23.08' into fix/issue_13674

29f987f

Switch to host span to remove round-trip

c8de39c

vyasr marked this pull request as ready for review July 19, 2023 19:32

vyasr requested a review from a team as a code owner July 19, 2023 19:32

vyasr requested review from karthikeyann, ttnghia and vuule July 19, 2023 19:32

ttnghia reviewed Jul 19, 2023

View reviewed changes

cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved

ttnghia approved these changes Jul 19, 2023

View reviewed changes

vuule reviewed Jul 19, 2023

View reviewed changes

cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved

karthikeyann previously requested changes Jul 24, 2023

View reviewed changes

cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved

PR reviews

48e5465

vyasr requested review from karthikeyann and vuule July 24, 2023 19:33

vuule approved these changes Jul 24, 2023

View reviewed changes

vyasr added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 24, 2023

vyasr changed the title ~~Optimize ORC reader performance for list data~~ Optimize ORC reader performance for list data Jul 25, 2023

rapids-bot bot merged commit 67e81ae into rapidsai:branch-23.08 Jul 25, 2023

vyasr deleted the fix/issue_13674 branch July 25, 2023 23:23

GregoryKimball mentioned this pull request Sep 10, 2023

[FEA] Improve ORC reader filtering and performance #13882

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ORC reader performance for list data #13708

Optimize ORC reader performance for list data #13708

vyasr commented Jul 17, 2023 •

edited

Loading

vuule left a comment

vyasr commented Jul 19, 2023

vuule left a comment

vyasr commented Jul 25, 2023

vyasr commented Jul 25, 2023

Optimize ORC reader performance for list data #13708

Optimize ORC reader performance for list data #13708

Conversation

vyasr commented Jul 17, 2023 • edited Loading

Description

Checklist

vuule left a comment

Choose a reason for hiding this comment

vyasr commented Jul 19, 2023

vuule left a comment

Choose a reason for hiding this comment

vyasr commented Jul 25, 2023

vyasr commented Jul 25, 2023

vyasr commented Jul 17, 2023 •

edited

Loading