-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ORC reader performance for list data #13708
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff!
Requesting a change for a suspected simplification.
Moving benchmarks to a separate thread to avoid polluting the Git log when the PR merges: Test script from #13674
libcudf ORC reader benchmarks
To also test the behavior of the new code on wide tables, I reran the test script from #13674, except instead of using tall tables I used wide tables containing varying number of columns. Wide table
In this case it looks like the new code is a bit slower as expected. Given that the slowdown is on the order of a factor of 2 at worst whereas the improvements in the other cases are a couple orders of magnitude, the tradeoff seems more than worth it. In the future we could explore dispatching to different algorithms based on table width, but we'd want to come up with reliable heuristics first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just the few posted comments.
In the interest of wrapping things up in time for code freeze, I'm going to merge this since there are two approvals and @karthikeyann's requests have been addressed. Happy to do a follow-up if you have more requests afterwards though @karthikeyann. |
/merge |
Description
For list types the ORC reader needs to generate offsets from the sizes of nested lists. This process was previously being parallelized over columns. In practice even with wide tables we have enough rows that parallelizing over rows always makes more sense, so this PR swaps the parallelization strategy.
I also removed what appears to be an unnecessary stream synchronization. That likely won't affect performance in any microbenchmarks but is worthwhile in case it helps improve asynchronous execution overall.
There are still noticeable bottlenecks for deeply nested lists, but those are in the decode kernels so optimizing them is a separate task for future work.
Resolves #13674
Checklist