-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of expression evaluation #9210
Conversation
…e wherever possible.
…rather than the constructor and forward it along.
This reverts commit 7f3c775.
Here are some benchmarks Conditional joins on a Tesla T4: Benchmarks
Before:
After:
Benchmarks
Before:
After:
Conditional joins on an RTX 8000: Benchmarks
Before:
After:
Benchmarks
Before:
After:
|
__device__ possibly_null_value_t<Element, has_nulls> resolve_input( | ||
detail::device_data_reference device_data_reference, | ||
CUDA_DEVICE_CALLABLE possibly_null_value_t<Element, has_nulls> resolve_input( | ||
detail::device_data_reference const& input_reference, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, did you try seeing if passing by const
value had the same impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did test that, and it seemed to work in some cases but not others. In particular the nullable template had different performance characteristics from the non-nullable one, w.r.t to both this change and a number of others. The obvious culprit is the fact that the nullable template simply has more local state, so I need to reduce it even further to see measurable improvements in occupancy, but I also observed more subtle differences that I'd probably attribute to the compiler just not doing as good a job at some point. I have some ideas for redesigning the data references that might hopefully make these issues moot, but I'll need to play with those a bit and I figured that this perf boost was a nice short-term win while I do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, this is much cleaner and the performance boost is nice. I have a few minor suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving once comments are fixed.
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #9210 +/- ##
===============================================
Coverage ? 10.84%
===============================================
Files ? 115
Lines ? 19173
Branches ? 0
===============================================
Hits ? 2080
Misses ? 17093
Partials ? 0 Continue to review full report at Codecov.
|
rerun tests |
rerun tests |
1 similar comment
rerun tests |
@gpucibot merge |
This PR does some minor reworking of the internals of expression evaluation to improve performance. The largest performance improvements come from passing device data references down the call stack by reference rather than by value. The nullable kernel template experiences significantly higher register pressure and these changes do not seem to be as effective at increasing occupancy on the benchmarks with null data, but in general we see performance improvements across the board for non-null data and in some cases for nulllable data, with improvements ranging up to 40%. This PR also does some minor cleanup: removing some unused functions, replacing
__device__
withCUDA_DEVICE_CALLABLE
to ensure compatibility with host compilers, and fixing the templating of various functions to ensure proper usage of CRTP. These changes are intended to facilitate future redesigning of the internals of thedevice_data_references
to reduce the depth of these call stacks, simplify the code, and reduce register pressure.