-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Investigating build times and number of specializations #1201
Comments
Definitely, it's a well-known problem of the What we potentially can improve is specializations of this kernel (which are done via the ivfpq_compute_similarity helper struct). On the one hand, they speed up the compile times by allowing compiling different instances in parallel. On the other hand, the current implementation of this produces some instances that do not make sense. For example, one could argue that it does not make much sense to have |
This sounds like a good idea. Perhaps we can make it a run-time error? Something like: if constexpr (sizeof(OutT) < sizeof(LutT) {
RAFT_EXPECTS(false, "Size of LutT may not be larger than size of OutT");
} else {
// launch kernel
} |
Sure, that's not a problem, we can restrict the instantiation in the kernel selection logic ( |
For the pairwise-distance kernels, I think the following would help:
For measurements, I would propose we also save the As a general coding practice, maybe the following could help:
template <typename T1, typename T2, typename T3>
struct foo {
T2 do1(T1 x) { /* .. */}
T3 do2(T2 x) { /* .. */}
T3 call(T1 x) {
auto y = do1(x);
auto z = do2(x);
return z;
}
}; We could use this pattern instead: template <typename T1, T2>
T2 do1(T1 x) { /* .. */}
template <typename T2, T3>
T3 do2(T2 x) { /* .. */}
template <typename T1, T2, T3>
T3 call(T1 x) {
auto y = do1(x);
auto z = do2(x);
return z;
} This will reduce the number of instantiations. For instance, when the types I am guessing that @Nyrio has some input as well! |
Hi I am sharing my progress in this PR: #1228 I have so far reduced compile times by 22% and number of kernels by more than 50% with two 2-line changes in the pairwisedistance code. |
Another measurement technique we can use is add the
|
For IVF-PQ compute similarity kernels, it does not add much value to have both raft/cpp/src/distance/neighbors/specializations/detail/ivfpq_compute_similarity_float_fast.cu Lines 23 to 25 in aa14ed4
[update] Unfortunately we have public interface both for
@achirkin mentioned that it might be possible to keep in this internal kernel only uint32 (because we are working with smaller chunks of the dataset). He is investigating this option. |
Posting a ninjatracing log here to provide a breakdown of the compile times on my workstation. Just to focus on the things that other projects depend directly upon, this trace only includes source files compiled into the shared libs and not tests or benchmarks. One good thing to note is that most of the source files which are bottlenecks are in a Other offenders (not necessarily in order) in addition to ivfpq_search specializations:
|
Here's the timeline for the end-to-end build w/o #1232: And with #1232: Unfortunately, I need to revert a couple changes for the release because the build time in CI has caused some problems for users. For 23.04, we should focus on getting those changes back while also introducing more changes like #1230 #1228 and #1220 to lower the build time further. It would be great it we can get the end-to-end build time under 20 minutes again on a single architecture. Now that we're building for 6 architectures w/ CUDA 11.8, we should strive to keep the initial cpp-build times in CI to just over an hour if possible. |
This is some additional analysis on the ninja log traces that @cjnolet shared. It shows the compilation units that have more than 1 minute compile time and compares between before and after the revert in #1232. The primary impact of the revert seems to be on |
@ahendriksen that analysis is great! Strangely, I wasn't seeing that issue at all when compiling exclusively for sm_70. It appears so be something that starts w/ sm_80 and above? But out of the 6 architectures that compile in CI, I guess at least 4 of them are sm_80 and above so that might explain why that file alone is taking 4+ hours. |
I did have 42 minute compile time on the |
I opened #1235 to remove all the uint32_t specializations. |
Another idea for ivf-pq. enum class compute_optimization {
/** Represent the internal scores and the lookup table as float32. */
LEVEL_0,
/** Represent the internal scores as float32 and the lookup table as float16. */
LEVEL_1,
/** Represent the internal scores as float16 and the lookup table as float8. */
LEVEL_2
} This should give us another 60% reduction in number of instances. |
…1249) Refactor `ivf_pq::index` to keep the cluster data in separate lists/buffers and reduce the number of template instantiations where possible. - Breaking change: the public structure `ivf_pq::index` is changed. - Partially addresses #1201: removing the `mdarray<IdxT, ...> list_offsets` member makes it possible to tweak the search kernel to remove the `IdxT` template parameter altogether and stick to `uint32_t` _sample_ indices, which are then backtraced to the database indices (`IdxT`) during the post-processing stage. - Partially addresses #1170 - Improves the index extending time: - No need to calculate offsets, reorder clusters, etc. - The individual lists are shared between multiple versions of the index and amortize the allocation/copy costs. - Small addition to the tests: split serialize/deserialize tests in a separate test cases to allow checking the index state in all possible scenarios (search after: building new, extending, deserializing). Authors: - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: #1249
On this branch, I have documented how to reduce the compile times of a specific translation unit ( I think some of these techniques can be applied to RAFT. |
As a result of the investigation, I have opened an issue to consider removing dependency on spdlog: #1300 |
Following up on the investigation I linked above, I have opened a PR #1307 that reduces the compilation times of pairwise distance instantiations (specializations) by up to 5x. It requires quite some rearchitecting of the code and I am not sure if that is acceptable, but it shows that big wins are possible. |
@ahendriksen I've looked through your PR with the refactors to the pairwise distance impl details and I think it looks great so far from my side. I've held off on submitting formal feedback to it because I noticed a couple things marked TODO in the code still. Let me know if it's ready and I can review it more closely. |
The PR is ready for review. |
While investigating some causes of our slow build times, @ahendriksen has provided some statistics (and scripts) to compute
Allard privided a script to extract the information from a binary using
cuobjdump
:It's clear that the
ivf_pq::ivfpq_compute_similarity_kernel
has the most specializations and is also adding to the larger binary size. It's also likely that this kernel is contributing significantly to the compile times oflibraft-distance
.cc @tfeher @achirkin
The second largest kernel looks like the
pairwiseDitanceMatKernel
. We should consider addressing some of the items at the top of this list to improve compile times. Now that CI is compiling w/ CUDA 11.8, it's compiling 6 different architectures (sm_60
,sm_70
,sm_75
,sm_80
,sm_86
,sm_90
)The text was updated successfully, but these errors were encountered: