-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] AST kernel has nonzero stack frame #5902
Comments
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. |
@harrism I think this issue can be closed. It provides some helpful information about the development of the AST feature but no further action is needed. Please re-open if needed. 👍 |
Given we're picking back up on the AST work, I'd actually like to keep this open so we don't forget about it. |
@jrhemstad I don't think there's anything actionable here any longer, is there? We haven't seen any performance regressions in the various merged iterations of the AST code since, so I don't think we have incurred any unforeseen additional local memory accesses. There are definitely things we can try to do reduce register pressure from the complex internals (mostly relating to trying to redesign the data reference structures), but I don't think those are directly related to this issue any longer and we know to keep an eye out for this problem as a potential cause if we observe future performance regressions. |
If you compare to the performance @bdice originally saw before there was any stack frame, I think there is still a pretty sizable regression. I think there is still room for analysis and optimization, but it's not pressing. |
Interesting. As of #8214 I found that the AST-based equality join was just under 2x slower than the raw nested loop join using a simple row equality comparator. I wouldn't have anticipated being able to reduce the overhead any further, but that's encouraging if you think there's further remove to improve that. |
I'm going to close this. Over time we may be able to leverage some new compiler options or similar to improve this situation marginally, but in the long term I think the only way we'll really be able to overcome the performance issues with evaluation are to switch away from the AST approach and go with something like #15366 |
During development of #5494, I have been expanding the supported AST interpreter's feature set. With a smaller set of supported operators and data types, the kernel ran quickly and had high memory throughput. As the set of features expanded, a nonzero stack frame was introduced and performance dropped significantly. With a nonzero stack frame, each thread must access "local memory" (actually in the global memory bank) during execution, which reduces performance (about 4-10x slower, from ~450 GB/s to ~50-100 GB/s). This issue documents my results from investigating this stack frame and attempts to uncover what caused it.
I have attached a spreadsheet of raw data from building and testing many commits, summarized below.
Stack Frame Tracing.xlsx
The commit "Add typed_operator_dispatch_functor" is the commit introducing double-dispatch to the AST kernel. This commit introduces a stack frame. Prior to this commit, the previous approach to AST device-side code was much more limited in functionality, which is why we redesigned the internals to use double-dispatch.
During early development of the double-dispatch code, I had zero stack frame with double-dispatch, but only until I merged in new code from upstream in
branch-0.15
. This suggests that the addition of new types (possibly fixed point) caused the kernel size to exceed some unknown limit, forcing it to have a nonzero stack frame. To reproduce this, I rebased the AST branch onto an older version ofbranch-0.15
from June 17th (the date the AST branch was created) and did not merge in any upstream changes frombranch-0.15
after that date (except for the required double-dispatch PR [REVIEW] Add a double type dispatcher. #5716). It looks like this commit triggered the introduction of a stack frame on branchast-rebase3
: bdice@12cdb22One way to get these statistics is:
or to build with nvcc flags
-Xptxas="-v"
, but that's much slower (ccache won't work if the compilation produces output).The text was updated successfully, but these errors were encountered: