-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAGRA template instantiations #1428
CAGRA template instantiations #1428
Conversation
99d08e5
to
03f781d
Compare
To define instantiations for other input types, one needs to add the types to the dictionaries here and here, generate the files, and edit CMakeList.txt here |
The reason for the large file size is that we instantiate the search kernels with a large combination of template parameters, e.g.: https://github.com/rapidsai/raft/pull/1428/files#diff-87898671b06ea9a5f8a5be95034a72ccdaecaab932ff73e60ef9b62edb8a3398R75-R272 |
5682bee
to
60184c1
Compare
Note to reviewer: it is recommended to review the commits separately:
|
uint32_t* const num_executed_iterations); | ||
|
||
// search_multi_cta_float_uint32_dim1024_t32.cu | ||
instantiate_multi_cta_search_kernel(32, 64, 16, 64, 1024, float, float, uint32_t, uint4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of instantaitions. Has any analysis been done to the perf impact of moving some of these to runtime arguments (specifically the first 5)? Or is the overall impact of the compile time/binary size negligable enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have started a discussion with Akira and @enp1s0 about this:
- CAGRA pad dataset for 128bit vectorized load #1505 will cut the template instantiations by half (having just
uint4
for last arg) - The next target is the (BLOCK_SIZE, BLOCK_COUNT) parameter pair. We have five variants of these. I will measure the perf impact of changing these to runtime arg. If that works out, then binary size would be already tolerable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. If it helps compile times, I don't mind merging this PR. Just generally would be nice to see the number of specializations reduced if it's impacting either compile time or binary size.
60184c1
to
ec5689b
Compare
This comment was marked as outdated.
This comment was marked as outdated.
700a057
to
5d6ac51
Compare
I have restored the instantiations for |
Without this PR
Note the prims bench (last line) will be added by #1496. In its current form the PR adds the following object files:
The compile time of tests and benchmark changes accordingly
|
Currently the explicit template instantiations are awfully verbose, because we spell out each parameter combination. We could make that shorter by having a higher lever wrapper, and instantiating that with fewer combination. |
@tfeher could you summarize the sum of change in compile times and binary sizes before/after this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, pending request for summary
Thanks @divyegala for the review. I would prefer to close this in favor of #1650. The new PR is essentially the same as this, the only difference is that in the new PR we replace the kernel dispatch macros with functions and we instantiate the dispatch functions. This way we need to enumerate less parameter combinations, and that saves around 2800 lines of template instantiations / declarations. The binary size increase, compile time decrease shall be the same for the two PRs, which is expected to be the same as listed in this commit: #1428 (comment). I will add a shorter summary message to the new PR once CI finishes. |
Cagra was introduced header only in #1375. This PR adds a precompiled single- and multi-cta search kernels to libraft.so. The single- and multi-cta search kernels were moved to separate header files to make it easier to specify extern template instantiations for these. The macros for dispatching the kernels were replaced by functions. We define explicit instantiations for the top level dispatch functions. (This is in contrast to #1428 where the kernels themselves were instantiated, which resulted in a large number of parameter combinations that had to be explicitly spelled out.) This PR fixes #1443. Authors: - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1650
Cagra was introduced header only in #1375. This PR adds a precompiled headers to NEIGHBORS_TEST. This accelerates compilation of cagra test. Once the binary size is reduced, it is expected to move the precompiled headers into libraft.
The single-cta search kernels are moved to separate header files, to make it easier to specify extern template instantiations for these. (These are necessary, because otherwise the kernels are implicitly instantiated when
struct search
is processed by the compiler, even if we have extern templates forsearch
).This PR fixes #1443.