Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAGRA template instantiations #1428

Closed

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented Apr 18, 2023

Cagra was introduced header only in #1375. This PR adds a precompiled headers to NEIGHBORS_TEST. This accelerates compilation of cagra test. Once the binary size is reduced, it is expected to move the precompiled headers into libraft.

The single-cta search kernels are moved to separate header files, to make it easier to specify extern template instantiations for these. (These are necessary, because otherwise the kernels are implicitly instantiated when struct search is processed by the compiler, even if we have extern templates for search).

This PR fixes #1443.

@tfeher tfeher force-pushed the cagra_template_instantiations branch from 99d08e5 to 03f781d Compare April 23, 2023 22:44
@tfeher tfeher marked this pull request as ready for review April 23, 2023 22:57
@tfeher tfeher requested review from a team as code owners April 23, 2023 22:57
@tfeher
Copy link
Contributor Author

tfeher commented Apr 23, 2023

To define instantiations for other input types, one needs to add the types to the dictionaries here and here, generate the files, and edit CMakeList.txt here

@tfeher tfeher added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 24, 2023
@tfeher
Copy link
Contributor Author

tfeher commented Apr 25, 2023

Note: this PR increases the binary size of libraft.so by 136 MB. It is a question whether to hold this PR off until #1459 is resolved. [Update] Due to the large binary size, the precompiled headers are only used in NEIGHBORS_TEST. Some details

filename Size (MB)
CMakeFiles/raft_lib.dir/src/neighbors/cagra_build_float_uint32.cu.o 11.239
CMakeFiles/raft_lib.dir/src/neighbors/cagra_search_float_uint32.cu.o 5.202
CMakeFiles/raft_lib.dir/src/neighbors/cagra_prune_float_uint32.cu.o 4.768
CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/search_single_cta_float_uint32_dim256_t16.cu.o 24.329
CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/search_single_cta_float_uint32_dim512_t32.cu.o 25.797
CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/search_single_cta_float_uint32_dim128_t8.cu.o 23.628
CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/search_multi_cta_float_uint32_dim1024_t32.cu.o 4.115
CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/search_multi_cta_float_uint32_dim128_t8.cu.o 3.698
CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/search_single_cta_float_uint32_dim1024_t32.cu.o 26.599
CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/search_multi_cta_float_uint32_dim256_t16.cu.o 3.718
CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/search_multi_cta_float_uint32_dim512_t32.cu.o 3.859
total 136.952

The reason for the large file size is that we instantiate the search kernels with a large combination of template parameters, e.g.: https://github.com/rapidsai/raft/pull/1428/files#diff-87898671b06ea9a5f8a5be95034a72ccdaecaab932ff73e60ef9b62edb8a3398R75-R272

@tfeher tfeher force-pushed the cagra_template_instantiations branch from 5682bee to 60184c1 Compare May 14, 2023 20:02
@tfeher
Copy link
Contributor Author

tfeher commented May 14, 2023

Note to reviewer: it is recommended to review the commits separately:

  • 95428b3 moves kernels to separate files, without any code change,
  • 97f7a48 adds template instantiations (bulky, but repetitive),
  • 60184c1 removes precompiled cagra headers from libraft.cu, and adds them NEIGHBORS_TEST. We plan to revert this, once CAGRA binary size too large. #1459 is solved.

uint32_t* const num_executed_iterations);

// search_multi_cta_float_uint32_dim1024_t32.cu
instantiate_multi_cta_search_kernel(32, 64, 16, 64, 1024, float, float, uint32_t, uint4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of instantaitions. Has any analysis been done to the perf impact of moving some of these to runtime arguments (specifically the first 5)? Or is the overall impact of the compile time/binary size negligable enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have started a discussion with Akira and @enp1s0 about this:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. If it helps compile times, I don't mind merging this PR. Just generally would be nice to see the number of specializations reduced if it's impacting either compile time or binary size.

@tfeher tfeher force-pushed the cagra_template_instantiations branch from 60184c1 to ec5689b Compare July 3, 2023 21:33
@tfeher tfeher requested review from a team as code owners July 3, 2023 21:33
@tfeher tfeher changed the base branch from branch-23.06 to branch-23.08 July 3, 2023 21:34
@tfeher

This comment was marked as outdated.

@tfeher tfeher force-pushed the cagra_template_instantiations branch from 700a057 to 5d6ac51 Compare July 4, 2023 10:19
@tfeher
Copy link
Contributor Author

tfeher commented Jul 4, 2023

I have restored the instantiations for search_multi_cta_kernel, and also added instantiations for uint64_t index type. The 64 bit index is only used in test, therefore the corresponding objects are only added to the tests.

@tfeher
Copy link
Contributor Author

tfeher commented Jul 5, 2023

Without this PR

without template instantiations compile time (h:min:s) size (MB)  
bench/ann/../raft_cagra.cu.o 1:01:42 152.539
test_float_uint32_t.cu.o 0:24:23 51.633
test_int8_t_uint32_t.cu.o 0:23:37 56.561
test_uint8_t_uint32_t.cu.o 0:23:29 56.667
test_float_int64_t.cu.o 0:20:43 40.784
bench/prims/../cagra_float_uint32_t.cu.o 0:22:30 49.958

Note the prims bench (last line) will be added by #1496.

In its current form the PR adds the following object files:

files libraft.so compile time (h:min:s) size (MB)
search_single_cta_float_uint32_dim1024_t32.cu.o 0:07:59 13
search_single_cta_float_uint32_dim128_t8.cu.o 0:07:42 11
search_single_cta_float_uint32_dim512_t32.cu.o 0:07:32 12
search_single_cta_float_uint32_dim256_t16.cu.o 0:07:31 12
search_single_cta_int8_uint32_dim1024_t32.cu.o 0:07:17 14
search_single_cta_uint8_uint32_dim1024_t32.cu.o 0:07:10 14
search_single_cta_uint8_uint32_dim256_t16.cu.o 0:07:05 12
search_single_cta_uint8_uint32_dim128_t8.cu.o 0:07:03 12
search_single_cta_int8_uint32_dim512_t32.cu.o 0:06:59 12
search_single_cta_uint8_uint32_dim512_t32.cu.o 0:06:57 12
search_single_cta_int8_uint32_dim128_t8.cu.o 0:06:47 12
search_single_cta_int8_uint32_dim256_t16.cu.o 0:06:42 12
search_multi_cta_float_uint32_dim1024_t32.cu.o 0:01:33 1
search_multi_cta_uint8_uint32_dim1024_t32.cu.o 0:01:32 2
search_multi_cta_int8_uint32_dim1024_t32.cu.o 0:01:32 2
search_multi_cta_float_uint32_dim128_t8.cu.o 0:01:25 1
search_multi_cta_float_uint32_dim256_t16.cu.o 0:01:24 1
search_multi_cta_float_uint32_dim512_t32.cu.o 0:01:22 1
search_multi_cta_uint8_uint32_dim512_t32.cu.o 0:01:22 1
search_multi_cta_uint8_uint32_dim256_t16.cu.o 0:01:22 1
search_multi_cta_int8_uint32_dim512_t32.cu.o 0:01:21 1
search_multi_cta_uint8_uint32_dim128_t8.cu.o 0:01:21 1
search_multi_cta_int8_uint32_dim256_t16.cu.o 0:01:21 1
search_multi_cta_int8_uint32_dim128_t8.cu.o 0:01:21 1
sum 1:43:40 166

The compile time of tests and benchmark changes accordingly

files cagra test only compile time (h:min:s) size (MB)  
bench/ann/.../raft_cagra.cu.o 0:03:20 6
search_single_cta_float_uint64_dim1024_t32.cu.o 0:05:43 11
search_single_cta_float_uint64_dim256_t16.cu.o 0:05:16 10
search_single_cta_float_uint64_dim512_t32.cu.o 0:05:14 11
search_single_cta_float_uint64_dim128_t8.cu.o 0:05:12 10
test_int8_t_uint32_t.cu.o 0:05:22 6
test_float_uint32_t.cu.o 0:05:21 6
test_uint8_t_uint32_t.cu.o 0:05:20 6
test_float_int64_t.cu.o 0:03:39 5
sum 0:44:37 72

@tfeher
Copy link
Contributor Author

tfeher commented Jul 5, 2023

Currently the explicit template instantiations are awfully verbose, because we spell out each parameter combination. We could make that shorter by having a higher lever wrapper, and instantiating that with fewer combination.

@divyegala
Copy link
Member

@tfeher could you summarize the sum of change in compile times and binary sizes before/after this PR?

divyegala
divyegala previously approved these changes Jul 14, 2023
Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, pending request for summary

@tfeher
Copy link
Contributor Author

tfeher commented Jul 17, 2023

Thanks @divyegala for the review. I would prefer to close this in favor of #1650. The new PR is essentially the same as this, the only difference is that in the new PR we replace the kernel dispatch macros with functions and we instantiate the dispatch functions. This way we need to enumerate less parameter combinations, and that saves around 2800 lines of template instantiations / declarations.

The binary size increase, compile time decrease shall be the same for the two PRs, which is expected to be the same as listed in this commit: #1428 (comment). I will add a shorter summary message to the new PR once CI finishes.

@tfeher tfeher closed this Jul 17, 2023
rapids-bot bot pushed a commit that referenced this pull request Jul 19, 2023
Cagra was introduced header only in #1375. This PR adds a precompiled single- and multi-cta search kernels to libraft.so. 

The single- and multi-cta search kernels were moved to separate header files to make it easier to specify extern template instantiations for these. 

The macros for dispatching the kernels were replaced by functions. We define explicit instantiations for the top level dispatch functions. (This is in contrast to #1428 where the kernels themselves were instantiated, which resulted in a large number of parameter combinations that had to be explicitly spelled out.)

This PR fixes #1443.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1650
@tfeher tfeher deleted the cagra_template_instantiations branch October 25, 2023 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Vector Search
Projects
Development

Successfully merging this pull request may close these issues.

Re-introduce CAGRA template instantiations to reduce compile time
3 participants