Skip to content

Commit

Permalink
Integrate accumulate_into_selected from ANN utils into `linalg::red…
Browse files Browse the repository at this point in the history
…uce_rows_by_keys` (#909)

`accumulate_into_selected` achieves much better performance than the previous implementation of `reduce_rows_by_keys` for large `nkeys` (`sum_rows_by_key_large_nkeys_kernel_rowmajor`). According to the benchmark that I added for this primitive, the difference is a factor of 240x for sizes relevant to IVF-Flat (and a factor of ~10x for smaller `nkeys`, e.g 64).

This is mostly because the legacy implementation, probably in an attempt to reduce atomic conflicts, assigned a key and a tile of the matrix to each block, and the block only reduces the rows corresponding to the assigned key. With a very large number of keys, e.g 1k, this results in blocks iterating over a large number of rows (possibly tens of thousands) and only reading and accumulating 1 in 1k rows.

This PR:

- Replaces `sum_rows_by_key_large_nkeys_rowmajor` with `accumulate_into_selected` (I didn't find any cases in which the old kernel performed better).
- Removes `accumulate_into_selected` from `ann_utils.cuh`.
- Fixes support for custom iterators in `reduce_rows_by_keys`.
- Uses the raft prims in `calc_centers_and_sizes`.

Perf notes:

- The original kmeans gets a 15-20% speedup for large numbers of clusters.
- The performance of `ivf_flat::build` stays the same as before.
- There are a bunch of extra steps since I separated the cluster size count from the reduction by key, but they are quite neglectable in comparison.

Question: the change breaks support for host-side-only arrays in `calc_centers_and_sizes`, is it actually a possibility? Should I add a branch and not use the raft prims when all arrays are host-side?

cc @achirkin @tfeher @cjnolet

Authors:
  - Louis Sugy (https://github.com/Nyrio)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #909
  • Loading branch information
Nyrio authored Oct 19, 2022
1 parent 7182819 commit 0de9ece
Show file tree
Hide file tree
Showing 9 changed files with 350 additions and 276 deletions.
1 change: 1 addition & 0 deletions cpp/bench/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ if(BUILD_BENCH)
bench/linalg/add.cu
bench/linalg/map_then_reduce.cu
bench/linalg/matrix_vector_op.cu
bench/linalg/reduce_rows_by_key.cu
bench/linalg/reduce.cu
bench/main.cpp
)
Expand Down
3 changes: 2 additions & 1 deletion cpp/bench/common/benchmark.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,9 @@ struct using_pool_memory_res {
rmm::mr::set_current_device_resource(&pool_res_);
}

using_pool_memory_res() : using_pool_memory_res(size_t(1) << size_t(30), size_t(16) << size_t(30))
using_pool_memory_res() : orig_res_(rmm::mr::get_current_device_resource()), pool_res_(&cuda_res_)
{
rmm::mr::set_current_device_resource(&pool_res_);
}

~using_pool_memory_res() { rmm::mr::set_current_device_resource(orig_res_); }
Expand Down
88 changes: 88 additions & 0 deletions cpp/bench/linalg/reduce_rows_by_key.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <common/benchmark.hpp>
#include <raft/linalg/reduce_rows_by_key.cuh>
#include <raft/random/rng.cuh>

#include <rmm/device_uvector.hpp>

namespace raft::bench::linalg {

struct rrbk_params {
int64_t rows, cols;
int64_t keys;
};

template <typename T, typename KeyT>
struct reduce_rows_by_key : public fixture {
reduce_rows_by_key(const rrbk_params& p)
: params(p),
in(p.rows * p.cols, stream),
out(p.keys * p.cols, stream),
keys(p.rows, stream),
workspace(p.rows, stream)
{
raft::random::RngState rng{42};
raft::random::uniformInt(rng, keys.data(), p.rows, (KeyT)0, (KeyT)p.keys, stream);
}

void run_benchmark(::benchmark::State& state) override
{
loop_on_state(state, [this]() {
raft::linalg::reduce_rows_by_key(in.data(),
params.cols,
keys.data(),
workspace.data(),
params.rows,
params.cols,
params.keys,
out.data(),
stream,
false);
});
}

protected:
rrbk_params params;
rmm::device_uvector<T> in, out;
rmm::device_uvector<KeyT> keys;
rmm::device_uvector<char> workspace;
}; // struct reduce_rows_by_key

const std::vector<rrbk_params> kInputSizes{
{10000, 128, 64},
{100000, 128, 64},
{1000000, 128, 64},
{10000000, 128, 64},
{10000, 128, 256},
{100000, 128, 256},
{1000000, 128, 256},
{10000000, 128, 256},
{10000, 128, 1024},
{100000, 128, 1024},
{1000000, 128, 1024},
{10000000, 128, 1024},
{10000, 128, 4096},
{100000, 128, 4096},
{1000000, 128, 4096},
{10000000, 128, 4096},
};

RAFT_BENCH_REGISTER((reduce_rows_by_key<float, uint32_t>), "", kInputSizes);
RAFT_BENCH_REGISTER((reduce_rows_by_key<double, uint32_t>), "", kInputSizes);

} // namespace raft::bench::linalg
Loading

0 comments on commit 0de9ece

Please sign in to comment.