Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ORC reader performance for list data #13708

Merged
merged 6 commits into from
Jul 25, 2023

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Jul 17, 2023

Description

For list types the ORC reader needs to generate offsets from the sizes of nested lists. This process was previously being parallelized over columns. In practice even with wide tables we have enough rows that parallelizing over rows always makes more sense, so this PR swaps the parallelization strategy.

I also removed what appears to be an unnecessary stream synchronization. That likely won't affect performance in any microbenchmarks but is worthwhile in case it helps improve asynchronous execution overall.

There are still noticeable bottlenecks for deeply nested lists, but those are in the decode kernels so optimizing them is a separate task for future work.

Resolves #13674

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vyasr vyasr added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 17, 2023
@vyasr vyasr self-assigned this Jul 17, 2023
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!
Requesting a change for a suspected simplification.

cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
@vyasr
Copy link
Contributor Author

vyasr commented Jul 19, 2023

Moving benchmarks to a separate thread to avoid polluting the Git log when the PR merges:

Test script from #13674

# Old implementation
parquet (numrows 1000) 0.0062103271484375 seconds                                                                                                                                                          
orc (numrows 1000) 0.0033025741577148438 seconds                                                                                                                                                           
parquet (numrows 50000000) 0.016933441162109375 seconds                                                                                                                                                    
orc (numrows 50000000) 1.1314363479614258 seconds                      

# New implementation
parquet (numrows 1000) 0.0062923431396484375 seconds                                                                                                                                                       
orc (numrows 1000) 0.003666400909423828 seconds                                                                                                                                                            
parquet (numrows 50000000) 0.016930341720581055 seconds                                                                                                                                                    
orc (numrows 50000000) 0.045706748962402344 seconds    

The changes clearly close the gap for the above case.
libcudf ORC reader benchmarks

# Old implementation
  |            LIST | DEVICE_BUFFER |           0 |          1 |     10x |  52.919 ms | 0.06% |  52.913 ms | 0.06% |      10146288118 |         1.005 GiB |       488.481 MiB |                            
  |            LIST | DEVICE_BUFFER |        1000 |          1 |      6x |  87.514 ms | 0.45% |  87.508 ms | 0.45% |       6135114527 |         1.014 GiB |       363.095 MiB |                            
  |            LIST | DEVICE_BUFFER |           0 |         32 |      8x |  67.902 ms | 0.12% |  67.897 ms | 0.12% |       7907179514 |       574.083 MiB |        34.179 MiB |                            
  |            LIST | DEVICE_BUFFER |        1000 |         32 |      8x |  70.422 ms | 0.09% |  70.416 ms | 0.09% |       7624239955 |       572.610 MiB |        23.229 MiB |    


# New implementation
  |            LIST | DEVICE_BUFFER |           0 |          1 |     12x |  45.033 ms | 0.09% |  45.028 ms | 0.09% |      11923067760 |         1.005 GiB |       488.481 MiB |                            
  |            LIST | DEVICE_BUFFER |        1000 |          1 |      7x |  79.599 ms | 0.49% |  79.593 ms | 0.49% |       6745162704 |         1.014 GiB |       363.095 MiB |                            
  |            LIST | DEVICE_BUFFER |           0 |         32 |      9x |  60.006 ms | 0.12% |  60.000 ms | 0.12% |       8947826725 |       574.083 MiB |        34.179 MiB |                            
  |            LIST | DEVICE_BUFFER |        1000 |         32 |      8x |  62.578 ms | 0.18% |  62.573 ms | 0.18% |       8579974589 |       572.610 MiB |        23.229 MiB |   

The ORC reader benchmarks above show a marked improvement for the benchmarked list data. All other benchmarks in that file remain unchanged.

To also test the behavior of the new code on wide tables, I reran the test script from #13674, except instead of using tall tables I used wide tables containing varying number of columns.

Wide table

# Old code
orc (numrows 1, numcols 10) 0.00503087043762207 seconds
orc (numrows 1, numcols 100) 0.012778520584106445 seconds
orc (numrows 1, numcols 1000) 0.1071467399597168 seconds
orc (numrows 1, numcols 2000) 0.21944570541381836 seconds

orc (numrows 1000, numcols 10) 0.015061140060424805 seconds
orc (numrows 1000, numcols 100) 0.012740135192871094 seconds
orc (numrows 1000, numcols 1000) 0.10806846618652344 seconds
orc (numrows 1000, numcols 2000) 0.2716226577758789 seconds

orc (numrows 10000, numcols 10) 0.016198396682739258 seconds
orc (numrows 10000, numcols 100) 0.013859033584594727 seconds
orc (numrows 10000, numcols 1000) 0.11646103858947754 seconds
orc (numrows 10000, numcols 2000) 0.23884344100952148 seconds

orc (numrows 100000, numcols 10) 0.02451324462890625 seconds
orc (numrows 100000, numcols 100) 0.02280116081237793 seconds
orc (numrows 100000, numcols 1000) 0.19908523559570312 seconds
orc (numrows 100000, numcols 2000) 0.43782830238342285 seconds

# New code
orc (numrows 1, numcols 10) 0.004733562469482422 seconds
orc (numrows 1, numcols 100) 0.014241456985473633 seconds
orc (numrows 1, numcols 1000) 0.12213611602783203 seconds
orc (numrows 1, numcols 2000) 0.25175976753234863 seconds

orc (numrows 1000, numcols 10) 0.014937639236450195 seconds
orc (numrows 1000, numcols 100) 0.01426076889038086 seconds
orc (numrows 1000, numcols 1000) 0.12375545501708984 seconds
orc (numrows 1000, numcols 2000) 0.30136775970458984 seconds

orc (numrows 10000, numcols 10) 0.016460657119750977 seconds
orc (numrows 10000, numcols 100) 0.015311002731323242 seconds
orc (numrows 10000, numcols 1000) 0.13204264640808105 seconds
orc (numrows 10000, numcols 2000) 0.2677123546600342 seconds

orc (numrows 100000, numcols 10) 0.0243988037109375 seconds
orc (numrows 100000, numcols 100) 0.02396106719970703 seconds
orc (numrows 100000, numcols 1000) 0.214888334274292 seconds
orc (numrows 100000, numcols 2000) 0.4545779228210449 seconds

In this case it looks like the new code is a bit slower as expected. Given that the slowdown is on the order of a factor of 2 at worst whereas the improvements in the other cases are a couple orders of magnitude, the tradeoff seems more than worth it. In the future we could explore dispatching to different algorithms based on table width, but we'd want to come up with reliable heuristics first.

@vyasr vyasr marked this pull request as ready for review July 19, 2023 19:32
@vyasr vyasr requested a review from a team as a code owner July 19, 2023 19:32
@vyasr vyasr requested review from karthikeyann, ttnghia and vuule July 19, 2023 19:32
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just the few posted comments.

cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
@vyasr vyasr requested review from karthikeyann and vuule July 24, 2023 19:33
@vyasr vyasr added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 24, 2023
@vyasr
Copy link
Contributor Author

vyasr commented Jul 25, 2023

In the interest of wrapping things up in time for code freeze, I'm going to merge this since there are two approvals and @karthikeyann's requests have been addressed. Happy to do a follow-up if you have more requests afterwards though @karthikeyann.

@vyasr
Copy link
Contributor Author

vyasr commented Jul 25, 2023

/merge

@vyasr vyasr changed the title Optimize ORC reader performance for list data Optimize ORC reader performance for list data Jul 25, 2023
@vyasr vyasr dismissed karthikeyann’s stale review July 25, 2023 23:22

Requests were addressed

@rapids-bot rapids-bot bot merged commit 67e81ae into rapidsai:branch-23.08 Jul 25, 2023
@vyasr vyasr deleted the fix/issue_13674 branch July 25, 2023 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Improve ORC reader performance with list columns and high row counts
4 participants