[PERF] AST kernel has nonzero stack frame #5902

bdice · 2020-08-10T16:36:45Z

During development of #5494, I have been expanding the supported AST interpreter's feature set. With a smaller set of supported operators and data types, the kernel ran quickly and had high memory throughput. As the set of features expanded, a nonzero stack frame was introduced and performance dropped significantly. With a nonzero stack frame, each thread must access "local memory" (actually in the global memory bank) during execution, which reduces performance (about 4-10x slower, from ~450 GB/s to ~50-100 GB/s). This issue documents my results from investigating this stack frame and attempts to uncover what caused it.

I have attached a spreadsheet of raw data from building and testing many commits, summarized below.
Stack Frame Tracing.xlsx

The commit "Add typed_operator_dispatch_functor" is the commit introducing double-dispatch to the AST kernel. This commit introduces a stack frame. Prior to this commit, the previous approach to AST device-side code was much more limited in functionality, which is why we redesigned the internals to use double-dispatch.
During early development of the double-dispatch code, I had zero stack frame with double-dispatch, but only until I merged in new code from upstream in branch-0.15. This suggests that the addition of new types (possibly fixed point) caused the kernel size to exceed some unknown limit, forcing it to have a nonzero stack frame. To reproduce this, I rebased the AST branch onto an older version of branch-0.15 from June 17th (the date the AST branch was created) and did not merge in any upstream changes from branch-0.15 after that date (except for the required double-dispatch PR [REVIEW] Add a double type dispatcher. #5716). It looks like this commit triggered the introduction of a stack frame on branch ast-rebase3: bdice@12cdb22

One way to get these statistics is:

cuobjdump --dump-resource-usage cpp/build/release/CMakeFiles/cudf.dir/src/ast/ast.cu.o | c++filt

or to build with nvcc flags -Xptxas="-v", but that's much slower (ccache won't work if the compilation produces output).

The text was updated successfully, but these errors were encountered:

github-actions · 2021-02-16T21:18:41Z

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

bdice · 2021-02-19T02:56:56Z

@harrism I think this issue can be closed. It provides some helpful information about the development of the AST feature but no further action is needed. Please re-open if needed. 👍

jrhemstad · 2021-02-19T18:39:03Z

Given we're picking back up on the AST work, I'd actually like to keep this open so we don't forget about it.

vyasr · 2021-08-10T00:40:18Z

@jrhemstad I don't think there's anything actionable here any longer, is there? We haven't seen any performance regressions in the various merged iterations of the AST code since, so I don't think we have incurred any unforeseen additional local memory accesses. There are definitely things we can try to do reduce register pressure from the complex internals (mostly relating to trying to redesign the data reference structures), but I don't think those are directly related to this issue any longer and we know to keep an eye out for this problem as a potential cause if we observe future performance regressions.

jrhemstad · 2021-08-10T03:15:06Z

We haven't seen any performance regressions in the various merged iterations of the AST code since

If you compare to the performance @bdice originally saw before there was any stack frame, I think there is still a pretty sizable regression. I think there is still room for analysis and optimization, but it's not pressing.

vyasr · 2021-08-11T00:17:20Z

Interesting. As of #8214 I found that the AST-based equality join was just under 2x slower than the raw nested loop join using a simple row equality comparator. I wouldn't have anticipated being able to reduce the overhead any further, but that's encouraging if you think there's further remove to improve that.

vyasr · 2024-05-16T17:06:31Z

I'm going to close this. Over time we may be able to leverage some new compiler options or similar to improve this situation marginally, but in the long term I think the only way we'll really be able to overcome the performance issues with evaluation are to switch away from the AST approach and go with something like #15366

bdice added the Performance Performance related issue label Aug 10, 2020

bdice self-assigned this Aug 10, 2020

github-actions bot added the rotten label Feb 16, 2021

bdice closed this as completed Feb 19, 2021

jrhemstad reopened this Feb 19, 2021

jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-90d labels Feb 19, 2021

kkraus14 added the improvement Improvement / enhancement to an existing function label Mar 29, 2021

vyasr mentioned this issue Jun 8, 2021

Enable AST-based joining #8214

Merged

beckernick added this to the Conditional Joins milestone Jul 23, 2021

vyasr mentioned this issue Oct 28, 2021

[FEA] Restructure AST internals to reduce stack depth and register pressure #9557

Open

bdice removed their assignment Jan 24, 2022

GregoryKimball modified the milestones: Conditional Joins, Expression evaluation Oct 24, 2022

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Nov 21, 2022

vyasr closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] AST kernel has nonzero stack frame #5902

[PERF] AST kernel has nonzero stack frame #5902

bdice commented Aug 10, 2020 •

edited

Loading

github-actions bot commented Feb 16, 2021

bdice commented Feb 19, 2021

jrhemstad commented Feb 19, 2021

vyasr commented Aug 10, 2021

jrhemstad commented Aug 10, 2021

vyasr commented Aug 11, 2021

vyasr commented May 16, 2024

[PERF] AST kernel has nonzero stack frame #5902

[PERF] AST kernel has nonzero stack frame #5902

Comments

bdice commented Aug 10, 2020 • edited Loading

github-actions bot commented Feb 16, 2021

bdice commented Feb 19, 2021

jrhemstad commented Feb 19, 2021

vyasr commented Aug 10, 2021

jrhemstad commented Aug 10, 2021

vyasr commented Aug 11, 2021

vyasr commented May 16, 2024

bdice commented Aug 10, 2020 •

edited

Loading