[FEA] Investigating build times and number of specializations #1201

cjnolet · 2023-01-27T18:55:47Z

While investigating some causes of our slow build times, @ahendriksen has provided some statistics (and scripts) to compute

a ranking of which device functions in a cuda-compiled binary have the most specializations
the total sizes of the functions from 1, to help determine the total impact.

Allard privided a script to extract the information from a binary using cuobjdump:

# Get binary sizes per kernel


# See: <https://nvidia.slack.com/archives/C02DP33ERFS/p1661468240691609>

mkdir tmp_objdump
cp libraft_distance.so tmp_objdump/
cd tmp_objdump/
cuobjdump --extract-elf all libraft_distance.so
    
rm -f kernel_sizes.txt
    
# Write all kernel sizes to a file
for f in libraft_distance.*.sm_70.cubin ; do
    echo ${f};
    nm --print-size --size-sort --radix=d ${f} | sed 's/$_/_/'| sed 's/$.*//' | cu++filt >> kernel_sizes.txt ;
done
    
# Remove template arguments and remove kernel addresses. Output looks like:
# 0000000000000064 W void raft::matrix::detail::select::warpsort::block_kernel
cat kernel_sizes.txt | sed 's/^[0-9]*//' | sed 's/<.*//' | sort -h > kernel_sizes_filtered.txt

```

and a script to interpret the results:

```python
import numpy as np
import pandas as pd
import seaborn as sns
import io
from pathlib import Path

txt = Path('kernel_sizes_filtered.txt').read_text()

split_lines = [l.strip().replace(' void', '').split(' ') for l in txt.splitlines()]

for l in split_lines: 
    if len(l) != 3: 
        print(l)
        
split_lines[1]

sizes_str, indicators, names = zip(*split_lines)

sizes = np.array(list(map(int, sizes_str)))
sizes_MB = sizes / 1e6

df = pd.DataFrame(dict(size_MB=sizes_MB, name=names, indicator=indicators))

top20 = (df[df.indicator=='T']
          .groupby('name')['size_MB']
          .agg(['count', 'mean', 'sum'])
          .sort_values(by='sum', ascending=False)[:20])

top20

It's clear that the ivf_pq::ivfpq_compute_similarity_kernel has the most specializations and is also adding to the larger binary size. It's also likely that this kernel is contributing significantly to the compile times of libraft-distance.

cc @tfeher @achirkin

The second largest kernel looks like the pairwiseDitanceMatKernel. We should consider addressing some of the items at the top of this list to improve compile times. Now that CI is compiling w/ CUDA 11.8, it's compiling 6 different architectures (sm_60, sm_70, sm_75, sm_80, sm_86, sm_90)

The text was updated successfully, but these errors were encountered:

achirkin · 2023-01-28T07:54:08Z

Definitely, it's a well-known problem of the ivfpq_compute_similarity_kernel! At some point during integration, I downlifted the PqBits to runtime, but that made the compute scores part of the kernel much heavier on ALU and resulted in 10-20% downgrade in the search QPS. Hence we've decided to bring it back to the template parameters. It can take values from 4 to 8, resulting in 5x more template instantiations.

What we potentially can improve is specializations of this kernel (which are done via the ivfpq_compute_similarity helper struct). On the one hand, they speed up the compile times by allowing compiling different instances in parallel. On the other hand, the current implementation of this produces some instances that do not make sense. For example, one could argue that it does not make much sense to have LutT (float8/float16/float32 possible) larger than OutT (float16/float32 possible), hence we can avoid compiling (float32, float16) combination. EnableSMemLut boolean in the current selection logic implies PrecompBaseDiff, so we can disable one of the four combinations of the two booleans. With a bit of code tweaking, we can fix IdxT = uin32_t when the fused sort is disabled (Capacity == 0).

ahendriksen · 2023-01-30T08:14:48Z

For example, one could argue that it does not make much sense to have LutT (float8/float16/float32 possible) larger than OutT (float16/float32 possible), hence we can avoid compiling (float32, float16) combination

This sounds like a good idea. Perhaps we can make it a run-time error? Something like:

if constexpr (sizeof(OutT) < sizeof(LutT) {
  RAFT_EXPECTS(false, "Size of LutT may not be larger than size of OutT"); 
} else {
 // launch kernel
}

achirkin · 2023-01-30T08:40:12Z

Sure, that's not a problem, we can restrict the instantiation in the kernel selection logic (OutT is the ScoreT there) and throw the error right there, or along with the other runtime checks. This will only reduce the number of instances by 1/6 though.

ahendriksen · 2023-01-30T08:53:27Z

For the pairwise-distance kernels, I think the following would help:

Consider removing some specializations for data alignments that confer little performance benefit (e.g.:VecLen==2 for floats)
Standardize on a single index type. We now have specialization for int and uint32_t. That seems excessive.

For measurements, I would propose we also save the ninja.log file as a downloadable build artifact (I hope this is possible with github actions). This way, we can run ninjatracing easily to identify the bottlenecks.

As a general coding practice, maybe the following could help:

Shrink the "surface area" of the template specializations. Often we have multiple templates arguments, each of which is only used in a specific portion of the code. We often use the following patten:

template <typename T1, typename T2, typename T3>
struct foo {
  T2 do1(T1 x) { /* .. */}
  T3 do2(T2 x) { /* .. */}
  T3 call(T1 x) {
    auto y = do1(x);
    auto z = do2(x);
    return z;
  }
};

We could use this pattern instead:

template <typename T1, T2>
T2 do1(T1 x) { /* .. */}
template <typename T2, T3>
T3 do2(T2 x) { /* .. */}

template <typename T1, T2, T3>
T3 call(T1 x) {
    auto y = do1(x);
    auto z = do2(x);
    return z;
}

This will reduce the number of instantiations. For instance, when the types Ti can be float or double, then will have 8 instantiations of foo::do1 (one for each of the two possibilities of T1, T2, T3), but only 4 instantiations of do1 (one for each of the two possibilities of T1, T2). I hope that this can reduce build times, but I have not measured it.

I am guessing that @Nyrio has some input as well!

ahendriksen · 2023-02-02T14:32:40Z

Hi I am sharing my progress in this PR: #1228

I have so far reduced compile times by 22% and number of kernels by more than 50% with two 2-line changes in the pairwisedistance code.

ahendriksen · 2023-02-02T14:56:55Z

Another measurement technique we can use is add the --time nvcc_compile.log flag to the nvcc command-line. This will write the time taken by all intermediate compilation stages of nvcc to the file in csv format. We can make this file a downloadable artifact for further analysis, or create a graph/plot/table immediately in CI.

--time <file name>                              (-time)                         
        Generate a comma separated value table with the time taken by each compilation
        phase, and append it at the end of the file given as the option argument.
        If the file is empty, the column headings are generated in the first row
        of the table. If the file name is '-', the timing data is generated in stdout.

tfeher · 2023-02-02T18:32:32Z

For IVF-PQ compute similarity kernels, it does not add much value to have both int64 and uint64 specialization. ~~I quick fix would be to just remove the int64 version from all instantiations (and tests and benchmarks)~~:

raft/cpp/src/distance/neighbors/specializations/detail/ivfpq_compute_similarity_float_fast.cu

Lines 23 to 25 in aa14ed4

    
           template struct ivfpq_compute_similarity<uint64_t, float, float>::configured<true, true>; 
        
           template struct ivfpq_compute_similarity<int64_t, float, float>::configured<true, true>; 
        
           template struct ivfpq_compute_similarity<uint32_t, float, float>::configured<true, true>;

[update] Unfortunately we have public interface both for uint64 and int64 specialization:

cuML uses the int64 specialization https://github.com/rapidsai/cuml/blob/85b33dfd9d77d3ee38785bd3d5e1720c89fb5ebe/python/cuml/neighbors/nearest_neighbors.pyx#L757
our python wrapppers for IVF-PQ use uint64

raft/python/pylibraft/pylibraft/neighbors/ivf_pq/cpp/c_ivf_pq.pxd

Line 152 in cd40591

const index[uint64_t]& index,

@achirkin mentioned that it might be possible to keep in this internal kernel only uint32 (because we are working with smaller chunks of the dataset). He is investigating this option.

cjnolet · 2023-02-03T01:26:14Z

Posting a ninjatracing log here to provide a breakdown of the compile times on my workstation. Just to focus on the things that other projects depend directly upon, this trace only includes source files compiled into the shared libs and not tests or benchmarks.

One good thing to note is that most of the source files which are bottlenecks are in a specializations directory. Any source files compiling code for raft::runtime should be directly calling a specialization. neighbors/ivfpq_search.cu violates this currently so we should look into that for sure (as outlined in #1194) because it 1) needs to be split into separate compilation units and 2) use consistent template types so it can use the existing specializations.

Other offenders (not necessarily in order) in addition to ivfpq_search specializations:

ball_cover.cu
refine_*.cu
knn.cu (hoping this might be largely fixed once @benfred's new brute-force changes are merged)
lp_unexpanded_*.cu
cosine.cu
jensen_shannon.cu

cjnolet · 2023-02-03T12:26:42Z

Here's the timeline for the end-to-end build w/o #1232:

And with #1232:

Unfortunately, I need to revert a couple changes for the release because the build time in CI has caused some problems for users. For 23.04, we should focus on getting those changes back while also introducing more changes like #1230 #1228 and #1220 to lower the build time further. It would be great it we can get the end-to-end build time under 20 minutes again on a single architecture. Now that we're building for 6 architectures w/ CUDA 11.8, we should strive to keep the initial cpp-build times in CI to just over an hour if possible.

ahendriksen · 2023-02-03T14:32:21Z

This is some additional analysis on the ninja log traces that @cjnolet shared. It shows the compilation units that have more than 1 minute compile time and compares between before and after the revert in #1232. The primary impact of the revert seems to be on lp_unexpanded_float_float_float_uint32.cu and canberra.cu.

cjnolet · 2023-02-03T14:38:45Z

@ahendriksen that analysis is great! Strangely, I wasn't seeing that issue at all when compiling exclusively for sm_70. It appears so be something that starts w/ sm_80 and above? But out of the 6 architectures that compile in CI, I guess at least 4 of them are sm_80 and above so that might explain why that file alone is taking 4+ hours.

ahendriksen · 2023-02-03T14:42:11Z

I did have 42 minute compile time on the lp_unexpanded with sm_70. (pre-revert). I would be surprised if it were architecture specific.

cjnolet · 2023-02-03T15:20:41Z

I opened #1235 to remove all the uint32_t specializations.

achirkin · 2023-02-06T16:19:15Z

Another idea for ivf-pq.
Atm, we have lut_dtype and internal_distance_type as orthogonal parameters, which gives 8 different variants (lut_dtype = {fp8_signed, fp8_unsigned, fp16, fp32 }, internal_distance_type = {fp16, fp32}). I suggested above to remove the variants where sizeof(lut_dtype) > sizeof(internal_distance_type), which is only one combination out of eight (fp32, fp16). What if, instead, we'd introduce just a few configurations which make most sense in terms of performance/recall compromise? i.e. remove these two search parameters in favor of something like

enum class compute_optimization {
  /** Represent the internal scores and the lookup table as float32. */
  LEVEL_0,
  /** Represent the internal scores as float32 and the lookup table as float16. */
  LEVEL_1,
  /** Represent the internal scores as float16 and the lookup table as float8. */
  LEVEL_2
}

This should give us another 60% reduction in number of instances.

…1249) Refactor `ivf_pq::index` to keep the cluster data in separate lists/buffers and reduce the number of template instantiations where possible. - Breaking change: the public structure `ivf_pq::index` is changed. - Partially addresses #1201: removing the `mdarray<IdxT, ...> list_offsets` member makes it possible to tweak the search kernel to remove the `IdxT` template parameter altogether and stick to `uint32_t` _sample_ indices, which are then backtraced to the database indices (`IdxT`) during the post-processing stage. - Partially addresses #1170 - Improves the index extending time: - No need to calculate offsets, reorder clusters, etc. - The individual lists are shared between multiple versions of the index and amortize the allocation/copy costs. - Small addition to the tests: split serialize/deserialize tests in a separate test cases to allow checking the index state in all possible scenarios (search after: building new, extending, deserializing). Authors: - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: #1249

ahendriksen · 2023-02-23T15:22:20Z

On this branch, I have documented how to reduce the compile times of a specific translation unit (l1_float_float_float_int.cu) from 20s to 4.8s and potentially to 2.8s.

I think some of these techniques can be applied to RAFT.

ahendriksen · 2023-02-23T15:58:15Z

As a result of the investigation, I have opened an issue to consider removing dependency on spdlog: #1300

ahendriksen · 2023-02-28T18:17:54Z

Following up on the investigation I linked above, I have opened a PR #1307 that reduces the compilation times of pairwise distance instantiations (specializations) by up to 5x.

It requires quite some rearchitecting of the code and I am not sure if that is acceptable, but it shows that big wins are possible.

@teju85 , @tfeher for visibility.

cjnolet · 2023-02-28T20:05:44Z

@ahendriksen I've looked through your PR with the refactors to the pairwise distance impl details and I think it looks great so far from my side. I've held off on submitting formal feedback to it because I noticed a couple things marked TODO in the code still. Let me know if it's ready and I can review it more closely.

ahendriksen · 2023-03-01T12:56:41Z

The PR is ready for review.

cjnolet added the feature request New feature or request label Jan 27, 2023

cjnolet added this to VS/ML/DM Primitives Release Board Jan 27, 2023

cjnolet mentioned this issue Jan 30, 2023

[BUG] Inconsistent use of IVF-PQ specialized types #1194

Closed

achirkin mentioned this issue Feb 6, 2023

IVF-PQ: store cluster data in individual lists and reduce templates #1249

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Investigating build times and number of specializations #1201

[FEA] Investigating build times and number of specializations #1201

cjnolet commented Jan 27, 2023 •

edited

Loading

achirkin commented Jan 28, 2023

ahendriksen commented Jan 30, 2023

achirkin commented Jan 30, 2023

ahendriksen commented Jan 30, 2023

ahendriksen commented Feb 2, 2023

ahendriksen commented Feb 2, 2023

tfeher commented Feb 2, 2023 •

edited

Loading

cjnolet commented Feb 3, 2023

cjnolet commented Feb 3, 2023 •

edited

Loading

ahendriksen commented Feb 3, 2023

cjnolet commented Feb 3, 2023

ahendriksen commented Feb 3, 2023

cjnolet commented Feb 3, 2023

achirkin commented Feb 6, 2023

ahendriksen commented Feb 23, 2023

ahendriksen commented Feb 23, 2023

ahendriksen commented Feb 28, 2023

cjnolet commented Feb 28, 2023

ahendriksen commented Mar 1, 2023

[FEA] Investigating build times and number of specializations #1201

[FEA] Investigating build times and number of specializations #1201

Comments

cjnolet commented Jan 27, 2023 • edited Loading

achirkin commented Jan 28, 2023

ahendriksen commented Jan 30, 2023

achirkin commented Jan 30, 2023

ahendriksen commented Jan 30, 2023

ahendriksen commented Feb 2, 2023

ahendriksen commented Feb 2, 2023

tfeher commented Feb 2, 2023 • edited Loading

cjnolet commented Feb 3, 2023

cjnolet commented Feb 3, 2023 • edited Loading

ahendriksen commented Feb 3, 2023

cjnolet commented Feb 3, 2023

ahendriksen commented Feb 3, 2023

cjnolet commented Feb 3, 2023

achirkin commented Feb 6, 2023

ahendriksen commented Feb 23, 2023

ahendriksen commented Feb 23, 2023

ahendriksen commented Feb 28, 2023

cjnolet commented Feb 28, 2023

ahendriksen commented Mar 1, 2023

cjnolet commented Jan 27, 2023 •

edited

Loading

tfeher commented Feb 2, 2023 •

edited

Loading

cjnolet commented Feb 3, 2023 •

edited

Loading