diff --git a/DEVELOPER_GUIDE.md b/DEVELOPER_GUIDE.md index 5c1e122525..e1dd682fd9 100644 --- a/DEVELOPER_GUIDE.md +++ b/DEVELOPER_GUIDE.md @@ -4,7 +4,7 @@ Devloping features and fixing bugs for the RAFT library itself is straightforward and only requires building and installing the relevant RAFT artifacts. -The process for working on a CUDA/C++ feature which spans RAFT and one or more consumers can vary slightly depending on whether the consuming project relies on a source build (as outlined in the [BUILD](BUILD.md#install_header_only_cpp) docs). In such a case, the option `CPM_raft_SOURCE=/path/to/raft/source` can be passed to the cmake of the consuming project in order to build the local RAFT from source. The PR with relevant changes to the consuming project can also pin the RAFT version temporarily by explicitly changing the `FORK` and `PINNED_TAG` arguments to the RAFT branch containing their changes when invoking `find_and_configure_raft`. The pin should be reverted after the changed is merged to the RAFT project and before it is merged to the dependent project(s) downstream. +The process for working on a CUDA/C++ feature which might span RAFT and one or more consuming libraries can vary slightly depending on whether the consuming project relies on a source build (as outlined in the [BUILD](BUILD.md#install_header_only_cpp) docs). In such a case, the option `CPM_raft_SOURCE=/path/to/raft/source` can be passed to the cmake of the consuming project in order to build the local RAFT from source. The PR with relevant changes to the consuming project can also pin the RAFT version temporarily by explicitly changing the `FORK` and `PINNED_TAG` arguments to the RAFT branch containing their changes when invoking `find_and_configure_raft`. The pin should be reverted after the changed is merged to the RAFT project and before it is merged to the dependent project(s) downstream. If building a feature which spans projects and not using the source build in cmake, the RAFT changes (both C++ and Python) will need to be installed into the environment of the consuming project before they can be used. The ideal integration of RAFT into consuming projects will enable both the source build in the consuming project only for this case but also rely on a more stable packaging (such as conda packaging) otherwise. @@ -14,6 +14,16 @@ Since RAFT is a core library with multiple consumers, it's important that the pu The public APIs should be lightweight wrappers around calls to private APIs inside the `detail` namespace. +## Common Design Considerations + +1. Use the `hpp` extension for files which can be compiled with `gcc` against the CUDA-runtime. Use the `cuh` extension for files which require `nvcc` to be compiled. `hpp` can also be used for functions marked `__host__ __device__` only if proper checks are in place to remove the `__device__` designation when not compiling with `nvcc`. + +2. When additional classes, structs, or general POCO types are needed to be used for representing data in the public API, place them in a new file called `_types.hpp`. This tells users they are safe to expose these types on their own public APIs without bringing in device code. At a minimum, the definitions for these types, at least, should not require `nvcc`. In general, these classes should only store very simple state and should not perform their own computations. Instead, new functions should be exposed on the public API which accept these objects, reading or updating their state as necessary. + +3. Documentation for public APIs should be well documented, easy to use, and it is highly preferred that they include usage instructions. + +4. Before creating a new primitive, check to see if one exists already. If one exists but the API isn't flexible enough to include your use-case, consider first refactoring the existing primitive. If that is not possible without an extreme number of changes, consider how the public API could be made more flexible. If the new primitive is different enough from all existing primitives, consider whether an existing public API could invoke the new primitive as an option or argument. If the new primitive is different enough from what exists already, add a header for the new public API function to the appropriate subdirectory and namespace. + ## Testing It's important for RAFT to maintain a high test coverage in order to minimize the potential for downstream projects to encounter unexpected build or runtime behavior as a result of changes. A well-defined public API can help maintain compile-time stability but means more focus should be placed on testing the functional requirements and verifying execution on the various edge cases within RAFT itself. Ideally, bug fixes and new features should be able to be made to RAFT independently of the consuming projects. diff --git a/cpp/doxygen/Doxyfile.in b/cpp/doxygen/Doxyfile.in index 6f29e79146..549862600a 100644 --- a/cpp/doxygen/Doxyfile.in +++ b/cpp/doxygen/Doxyfile.in @@ -880,7 +880,27 @@ RECURSIVE = YES # run. EXCLUDE = @CMAKE_CURRENT_SOURCE_DIR@/include/raft/sparse/linalg/symmetrize.hpp \ - \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/cache \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/common \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/lap \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/sparse/selection \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/sparse/csr.hpp \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/linalg/lanczos.cuh \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/linalg/lanczos.hpp \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/cuda_utils.cuh \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/cudart_utils.h \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/util/device_atomics.cuh \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/device_utils.cuh \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/error.hpp \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/handle.hpp \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/integer_utils.h \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/interruptible.hpp \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/mdarray.hpp \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/pow2_utils.cuh \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/span.hpp \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/vectorized.cuh \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/raft.hpp \ + @CMAKE_CURRENT_SOURCE_DIR@/include/raft/core/cudart_utils.hpp # The EXCLUDE_SYMLINKS tag can be used to select whether or not files or # directories that are symbolic links (a Unix file system feature) are excluded diff --git a/cpp/include/raft.hpp b/cpp/include/raft.hpp index b1b8255b7e..f77d030a2d 100644 --- a/cpp/include/raft.hpp +++ b/cpp/include/raft.hpp @@ -17,7 +17,7 @@ /** * This file is deprecated and will be removed in release 22.06. */ -#include "raft/handle.hpp" +#include "raft/core/handle.hpp" #include "raft/mdarray.hpp" #include "raft/span.hpp" diff --git a/cpp/include/raft/cache/cache_util.cuh b/cpp/include/raft/cache/cache_util.cuh index 3e2222eff1..60da09ca7c 100644 --- a/cpp/include/raft/cache/cache_util.cuh +++ b/cpp/include/raft/cache/cache_util.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,356 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -#include -#include - -namespace raft { -namespace cache { - -/** - * @brief Collect vectors of data from the cache into a contiguous memory buffer. - * - * We assume contiguous memory layout for the output buffer, i.e. we get - * column vectors into a column major out buffer, or row vectors into a row - * major output buffer. - * - * On exit, the output array is filled the following way: - * out[i + n_vec*k] = cache[i + n_vec * cache_idx[k]]), where i=0..n_vec-1, and - * k = 0..n-1 where cache_idx[k] >= 0 - * - * We ignore vectors where cache_idx[k] < 0. - * - * @param [in] cache stores the cached data, size [n_vec x n_cached_vectors] - * @param [in] n_vec number of elements in a cached vector - * @param [in] cache_idx cache indices, size [n] - * @param [in] n the number of elements that need to be collected - * @param [out] out vectors collected from the cache, size [n_vec * n] - */ -template -__global__ void get_vecs( - const math_t* cache, int_t n_vec, const idx_t* cache_idx, int_t n, math_t* out) -{ - int tid = threadIdx.x + blockIdx.x * blockDim.x; - int row = tid % n_vec; // row idx - if (tid < n_vec * n) { - size_t out_col = tid / n_vec; // col idx - size_t cache_col = cache_idx[out_col]; - if (cache_idx[out_col] >= 0) { - if (row + out_col * n_vec < (size_t)n_vec * n) { out[tid] = cache[row + cache_col * n_vec]; } - } - } -} - -/** - * @brief Store vectors of data into the cache. - * - * Elements within a vector should be contiguous in memory (i.e. column vectors - * for column major data storage, or row vectors of row major data). - * - * If tile_idx==nullptr then the operation is the opposite of get_vecs, - * i.e. we store - * cache[i + cache_idx[k]*n_vec] = tile[i + k*n_vec], for i=0..n_vec-1, k=0..n-1 - * - * If tile_idx != nullptr, then we permute the vectors from tile according - * to tile_idx. This allows to store vectors from a buffer where the individual - * vectors are not stored contiguously (but the elements of each vector shall - * be contiguous): - * cache[i + cache_idx[k]*n_vec] = tile[i + tile_idx[k]*n_vec], - * for i=0..n_vec-1, k=0..n-1 - * - * @param [in] tile stores the data to be cashed cached, size [n_vec x n_tile] - * @param [in] n_tile number of vectors in the input tile - * @param [in] n_vec number of elements in a cached vector - * @param [in] tile_idx indices of vectors that we want to store - * @param [in] n number of vectos that we want to store (n <= n_tile) - * @param [in] cache_idx cache indices, size [n], negative values are ignored - * @param [inout] cache updated cache - * @param [in] n_cache_vecs - */ -template -__global__ void store_vecs(const math_t* tile, - int n_tile, - int n_vec, - const int* tile_idx, - int n, - const int* cache_idx, - math_t* cache, - int n_cache_vecs) -{ - int tid = threadIdx.x + blockIdx.x * blockDim.x; - int row = tid % n_vec; // row idx - if (tid < n_vec * n) { - int tile_col = tid / n_vec; // col idx - int data_col = tile_idx ? tile_idx[tile_col] : tile_col; - int cache_col = cache_idx[tile_col]; - - // We ignore negative values. The rest of the checks should be fulfilled - // if the cache is used properly - if (cache_col >= 0 && cache_col < n_cache_vecs && data_col < n_tile) { - cache[row + (size_t)cache_col * n_vec] = tile[row + (size_t)data_col * n_vec]; - } - } -} - -/** - * @brief Map a key to a cache set. - * - * @param key key to be hashed - * @param n_cache_sets number of cache sets - * @return index of the cache set [0..n_cache_set) - */ -int DI hash(int key, int n_cache_sets) { return key % n_cache_sets; } - -/** - * @brief Binary search to find the first element in the array which is greater - * equal than a given value. - * @param [in] array sorted array of n numbers - * @param [in] n length of the array - * @param [in] val the value to search for - * @return the index of the first element in the array for which - * array[idx] >= value. If there is no such value, then return n. - */ -int DI arg_first_ge(const int* array, int n, int val) -{ - int start = 0; - int end = n - 1; - if (array[0] == val) return 0; - if (array[end] < val) return n; - while (start + 1 < end) { - int q = (start + end + 1) / 2; - // invariants: - // start < end - // start < q <=end - // array[start] < val && array[end] <=val - // at every iteration d = end-start is decreasing - // when d==0, then array[end] will be the first element >= val. - if (array[q] >= val) { - end = q; - } else { - start = q; - } - } - return end; -} -/** - * @brief Find the k-th occurrence of value in a sorted array. - * - * Assume that array is [0, 1, 1, 1, 2, 2, 4, 4, 4, 4, 6, 7] - * then find_nth_occurrence(cset, 12, 4, 2) == 7, because cset_array[7] stores - * the second element with value = 4. - * If there are less than k values in the array, then return -1 - * - * @param [in] array sorted array of numbers, size [n] - * @param [in] n number of elements in the array - * @param [in] val the value we are searching for - * @param [in] k - * @return the idx of the k-th occurance of val in array, or -1 if - * the value is not found. - */ -int DI find_nth_occurrence(const int* array, int n, int val, int k) -{ - int q = arg_first_ge(array, n, val); - if (q + k < n && array[q + k] == val) { - q += k; - } else { - q = -1; - } - return q; -} - /** - * @brief Rank the entries in a cache set according to the time stamp, return - * the indices that would sort the time stamp in ascending order. - * - * Assume we have a single cache set with time stamps as: - * key (threadIdx.x): 0 1 2 3 - * val (time stamp): 8 6 7 5 - * - * The corresponding sorted key-value pairs: - * key: 3 1 2 0 - * val: 5 6 7 8 - * rank: 0th 1st 2nd 3rd - * - * On return, the rank is assigned for each thread: - * threadIdx.x: 0 1 2 3 - * rank: 3 1 2 0 - * - * For multiple cache sets, launch one block per cache set. - * - * @tparam nthreads number of threads per block (nthreads <= associativity) - * @tparam associativity number of items in a cache set - * - * @param [in] cache_time time stamp of caching the data, - size [associativity * n_cache_sets] - * @param [in] n_cache_sets number of cache sets - * @param [out] rank within the cache set size [nthreads * items_per_thread] - * Each block should give a different pointer for rank. + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -template -DI void rank_set_entries(const int* cache_time, int n_cache_sets, int* rank) -{ - const int items_per_thread = raft::ceildiv(associativity, nthreads); - typedef cub::BlockRadixSort BlockRadixSort; - __shared__ typename BlockRadixSort::TempStorage temp_storage; - - int key[items_per_thread]; - int val[items_per_thread]; - - int block_offset = blockIdx.x * associativity; - - for (int j = 0; j < items_per_thread; j++) { - int k = threadIdx.x + j * nthreads; - int t = (k < associativity) ? cache_time[block_offset + k] : 32768; - key[j] = t; - val[j] = k; - } - - BlockRadixSort(temp_storage).Sort(key, val); - - for (int j = 0; j < items_per_thread; j++) { - if (val[j] < associativity) { rank[val[j]] = threadIdx.x * items_per_thread + j; } - } - __syncthreads(); -} /** - * @brief Assign cache location to a set of keys using LRU replacement policy. - * - * The keys and the corresponding cache_set arrays shall be sorted according - * to cache_set in ascending order. One block should be launched for every cache - * set. - * - * Each cache set is sorted according to time_stamp, and values from keys - * are filled in starting at the oldest time stamp. Entries that were accessed - * at the current time are not reassigned. - * - * @tparam nthreads number of threads per block - * @tparam associativity number of keys in a cache set - * - * @param [in] keys that we want to cache size [n] - * @param [in] n number of keys - * @param [in] cache_set assigned to keys, size [n] - * @param [inout] cached_keys keys of already cached vectors, - * size [n_cache_sets*associativity], on exit it will be updated with the - * cached elements from keys. - * @param [in] n_cache_sets number of cache sets - * @param [inout] cache_time will be updated to "time" for those elements that - * could be assigned to a cache location, size [n_cache_sets*associativity] - * @param [in] time time stamp - * @param [out] cache_idx the cache idx assigned to the input, or -1 if it could - * not be cached, size [n] + * DISCLAIMER: this file is deprecated: use lap.cuh instead */ -template -__global__ void assign_cache_idx(const int* keys, - int n, - const int* cache_set, - int* cached_keys, - int n_cache_sets, - int* cache_time, - int time, - int* cache_idx) -{ - int block_offset = blockIdx.x * associativity; - - const int items_per_thread = raft::ceildiv(associativity, nthreads); - - // the size of rank limits how large associativity can be used in practice - __shared__ int rank[items_per_thread * nthreads]; - rank_set_entries(cache_time, n_cache_sets, rank); - - // Each thread will fill items_per_thread items in the cache. - // It uses a place, only if it was not updated at the current time step - // (cache_time != time). - // We rank the places according to the time stamp, least recently used - // elements come to the front. - // We fill the least recently used elements with the working set. - // there might be elements which cannot be assigned to cache loc. - // these elements are assigned -1. - for (int j = 0; j < items_per_thread; j++) { - int i = threadIdx.x + j * nthreads; - int t_idx = block_offset + i; - bool mask = (i < associativity); - // whether this slot is available for writing - mask = mask && (cache_time[t_idx] != time); +#pragma once - // rank[i] tells which element to store by this thread - // we look up where is the corresponding key stored in the input array - if (mask) { - int k = find_nth_occurrence(cache_set, n, blockIdx.x, rank[i]); - if (k > -1) { - int key_val = keys[k]; - cached_keys[t_idx] = key_val; - cache_idx[k] = t_idx; - cache_time[t_idx] = time; - } - } - } -} +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/util version instead.") -/* Unnamed namespace is used to avoid multiple definition error for the - following non-template function */ -namespace { -/** - * @brief Get the cache indices for keys stored in the cache. - * - * For every key, we look up the corresponding cache position. - * If keys[k] is stored in the cache, then is_cached[k] is set to true, and - * cache_idx[k] stores the corresponding cache idx. - * - * If keys[k] is not stored in the cache, then we assign a cache set to it. - * This cache set is stored in cache_idx[k], and is_cached[k] is set to false. - * In this case AssignCacheIdx should be called, to get an assigned position - * within the cache set. - * - * Cache_time is assigned to the time input argument for all elements in idx. - * - * @param [in] keys array of keys that we want to look up in the cache, size [n] - * @param [in] n number of keys to look up - * @param [inout] cached_keys keys stored in the cache, size [n_cache_sets * associativity] - * @param [in] n_cache_sets number of cache sets - * @param [in] associativity number of keys in cache set - * @param [inout] cache_time time stamp when the indices were cached, size [n_cache_sets * - * associativity] - * @param [out] cache_idx cache indices of the working set elements, size [n] - * @param [out] is_cached whether the element is cached size[n] - * @param [in] time iteration counter (used for time stamping) - */ -__global__ void get_cache_idx(int* keys, - int n, - int* cached_keys, - int n_cache_sets, - int associativity, - int* cache_time, - int* cache_idx, - bool* is_cached, - int time) -{ - int tid = threadIdx.x + blockIdx.x * blockDim.x; - if (tid < n) { - int widx = keys[tid]; - int sidx = hash(widx, n_cache_sets); - int cidx = sidx * associativity; - int i = 0; - bool found = false; - // search for empty spot and the least recently used spot - while (i < associativity && !found) { - found = (cache_time[cidx + i] > 0 && cached_keys[cidx + i] == widx); - i++; - } - is_cached[tid] = found; - if (found) { - cidx = cidx + i - 1; - cache_time[cidx] = time; // update time stamp - cache_idx[tid] = cidx; // exact cache idx - } else { - cache_idx[tid] = sidx; // assign cache set - } - } -} -}; // end unnamed namespace -}; // namespace cache -}; // namespace raft +#include diff --git a/cpp/include/raft/sparse/hierarchy/detail/agglomerative.cuh b/cpp/include/raft/cluster/detail/agglomerative.cuh similarity index 97% rename from cpp/include/raft/sparse/hierarchy/detail/agglomerative.cuh rename to cpp/include/raft/cluster/detail/agglomerative.cuh index c8a1eb8304..618f852bba 100644 --- a/cpp/include/raft/sparse/hierarchy/detail/agglomerative.cuh +++ b/cpp/include/raft/cluster/detail/agglomerative.cuh @@ -16,9 +16,9 @@ #pragma once -#include -#include -#include +#include +#include +#include #include @@ -35,11 +35,7 @@ #include -namespace raft { - -namespace hierarchy { -namespace detail { - +namespace raft::cluster::detail { template class UnionFind { public: @@ -329,6 +325,4 @@ void extract_flattened_clusters(const raft::handle_t& handle, } } -}; // namespace detail -}; // namespace hierarchy -}; // namespace raft +}; // namespace raft::cluster::detail diff --git a/cpp/include/raft/sparse/hierarchy/detail/connectivities.cuh b/cpp/include/raft/cluster/detail/connectivities.cuh similarity index 86% rename from cpp/include/raft/sparse/hierarchy/detail/connectivities.cuh rename to cpp/include/raft/cluster/detail/connectivities.cuh index f56366f21f..da8adf783d 100644 --- a/cpp/include/raft/sparse/hierarchy/detail/connectivities.cuh +++ b/cpp/include/raft/cluster/detail/connectivities.cuh @@ -16,18 +16,18 @@ #pragma once -#include -#include -#include +#include +#include +#include #include #include -#include +#include +#include #include #include -#include -#include +#include #include #include @@ -35,11 +35,9 @@ #include -namespace raft { -namespace hierarchy { -namespace detail { +namespace raft::cluster::detail { -template +template struct distance_graph_impl { void run(const raft::handle_t& handle, const value_t* X, @@ -58,7 +56,7 @@ struct distance_graph_impl { * @tparam value_t */ template -struct distance_graph_impl { +struct distance_graph_impl { void run(const raft::handle_t& handle, const value_t* X, size_t m, @@ -75,7 +73,7 @@ struct distance_graph_impl knn_graph_coo(stream); - raft::sparse::selection::knn_graph(handle, X, m, n, metric, knn_graph_coo, c); + raft::sparse::spatial::knn_graph(handle, X, m, n, metric, knn_graph_coo, c); indices.resize(knn_graph_coo.nnz, stream); data.resize(knn_graph_coo.nnz, stream); @@ -121,7 +119,7 @@ struct distance_graph_impl +template void get_distance_graph(const raft::handle_t& handle, const value_t* X, size_t m, @@ -140,6 +138,4 @@ void get_distance_graph(const raft::handle_t& handle, dist_graph.run(handle, X, m, n, metric, indptr, indices, data, c); } -}; // namespace detail -}; // namespace hierarchy -}; // namespace raft +}; // namespace raft::cluster::detail diff --git a/cpp/include/raft/cluster/detail/kmeans.cuh b/cpp/include/raft/cluster/detail/kmeans.cuh index 303de77078..ba646e8e3f 100644 --- a/cpp/include/raft/cluster/detail/kmeans.cuh +++ b/cpp/include/raft/cluster/detail/kmeans.cuh @@ -27,19 +27,19 @@ #include #include -#include +#include #include #include #include #include -#include -#include +#include #include #include #include #include #include #include +#include #include #include diff --git a/cpp/include/raft/cluster/detail/kmeans_common.cuh b/cpp/include/raft/cluster/detail/kmeans_common.cuh index 358c8ce16e..4c50ea2623 100644 --- a/cpp/include/raft/cluster/detail/kmeans_common.cuh +++ b/cpp/include/raft/cluster/detail/kmeans_common.cuh @@ -27,14 +27,13 @@ #include #include -#include +#include #include #include #include #include -#include #include -#include +#include #include #include #include @@ -42,6 +41,7 @@ #include #include #include +#include #include #include diff --git a/cpp/include/raft/cluster/detail/kmeans_deprecated.cuh b/cpp/include/raft/cluster/detail/kmeans_deprecated.cuh index d57fd5254a..2746b6f657 100644 --- a/cpp/include/raft/cluster/detail/kmeans_deprecated.cuh +++ b/cpp/include/raft/cluster/detail/kmeans_deprecated.cuh @@ -42,13 +42,13 @@ #include #include -#include -#include -#include -#include +#include #include #include #include +#include +#include +#include namespace raft { namespace cluster { diff --git a/cpp/include/raft/sparse/hierarchy/detail/mst.cuh b/cpp/include/raft/cluster/detail/mst.cuh similarity index 86% rename from cpp/include/raft/sparse/hierarchy/detail/mst.cuh rename to cpp/include/raft/cluster/detail/mst.cuh index 545a371850..67935d4623 100644 --- a/cpp/include/raft/sparse/hierarchy/detail/mst.cuh +++ b/cpp/include/raft/cluster/detail/mst.cuh @@ -16,25 +16,23 @@ #pragma once -#include -#include +#include +#include -#include #include -#include +#include +#include #include #include #include #include -namespace raft { -namespace hierarchy { -namespace detail { +namespace raft::cluster::detail { template -void merge_msts(raft::Graph_COO& coo1, - raft::Graph_COO& coo2, +void merge_msts(sparse::solver::Graph_COO& coo1, + sparse::solver::Graph_COO& coo2, cudaStream_t stream) { /** Add edges to existing mst **/ @@ -71,7 +69,7 @@ template void connect_knn_graph( const raft::handle_t& handle, const value_t* X, - raft::Graph_COO& msf, + sparse::solver::Graph_COO& msf, size_t m, size_t n, value_idx* color, @@ -82,7 +80,7 @@ void connect_knn_graph( raft::sparse::COO connected_edges(stream); - raft::linkage::connect_components( + raft::sparse::spatial::connect_components( handle, connected_edges, X, color, m, n, reduction_op); rmm::device_uvector indptr2(m + 1, stream); @@ -91,16 +89,17 @@ void connect_knn_graph( // On the second call, we hand the MST the original colors // and the new set of edges and let it restart the optimization process - auto new_mst = raft::mst::mst(handle, - indptr2.data(), - connected_edges.cols(), - connected_edges.vals(), - m, - connected_edges.nnz, - color, - stream, - false, - false); + auto new_mst = + raft::sparse::solver::mst(handle, + indptr2.data(), + connected_edges.cols(), + connected_edges.vals(), + m, + connected_edges.nnz, + color, + stream, + false, + false); merge_msts(msf, new_mst, stream); } @@ -150,18 +149,18 @@ void build_sorted_mst( auto stream = handle.get_stream(); // We want to have MST initialize colors on first call. - auto mst_coo = raft::mst::mst( + auto mst_coo = raft::sparse::solver::mst( handle, indptr, indices, pw_dists, (value_idx)m, nnz, color, stream, false, true); int iters = 1; - int n_components = linkage::get_n_components(color, m, stream); + int n_components = raft::sparse::spatial::get_n_components(color, m, stream); while (n_components > 1 && iters < max_iter) { connect_knn_graph(handle, X, mst_coo, m, n, color, reduction_op); iters++; - n_components = linkage::get_n_components(color, m, stream); + n_components = raft::sparse::spatial::get_n_components(color, m, stream); } /** @@ -192,6 +191,4 @@ void build_sorted_mst( raft::copy_async(mst_weight, mst_coo.weights.data(), mst_coo.n_edges, stream); } -}; // namespace detail -}; // namespace hierarchy -}; // namespace raft +}; // namespace raft::cluster::detail diff --git a/cpp/include/raft/sparse/hierarchy/detail/single_linkage.cuh b/cpp/include/raft/cluster/detail/single_linkage.cuh similarity index 91% rename from cpp/include/raft/sparse/hierarchy/detail/single_linkage.cuh rename to cpp/include/raft/cluster/detail/single_linkage.cuh index 4e94b6f65d..7de942444e 100644 --- a/cpp/include/raft/sparse/hierarchy/detail/single_linkage.cuh +++ b/cpp/include/raft/cluster/detail/single_linkage.cuh @@ -16,17 +16,15 @@ #pragma once -#include +#include #include -#include -#include -#include -#include +#include +#include +#include +#include -namespace raft { -namespace hierarchy { -namespace detail { +namespace raft::cluster::detail { static const size_t EMPTY = 0; @@ -82,7 +80,7 @@ void single_linkage(const raft::handle_t& handle, * 2. Construct MST, sorted by weights */ rmm::device_uvector color(m, stream); - raft::linkage::FixConnectivitiesRedOp op(color.data(), m); + raft::sparse::spatial::FixConnectivitiesRedOp op(color.data(), m); detail::build_sorted_mst(handle, X, indptr.data(), @@ -123,6 +121,4 @@ void single_linkage(const raft::handle_t& handle, out->n_leaves = m; out->n_connected_components = 1; } -}; // namespace detail -}; // namespace hierarchy -}; // namespace raft \ No newline at end of file +}; // namespace raft::cluster::detail \ No newline at end of file diff --git a/cpp/include/raft/cluster/kmeans.cuh b/cpp/include/raft/cluster/kmeans.cuh index d46f53d9c1..539fc33c40 100644 --- a/cpp/include/raft/cluster/kmeans.cuh +++ b/cpp/include/raft/cluster/kmeans.cuh @@ -17,12 +17,10 @@ #include #include -#include +#include #include -namespace raft { -namespace cluster { - +namespace raft::cluster { /** * @brief Find clusters with k-means algorithm. * Initial centroids are chosen with k-means++ algorithm. Empty @@ -488,5 +486,4 @@ void kmeans_fit_main(const raft::handle_t& handle, detail::kmeans_fit_main( handle, params, X, weight, centroidsRawData, inertia, n_iter, workspace); } -} // namespace cluster -} // namespace raft +} // namespace raft::cluster diff --git a/cpp/include/raft/cluster/kmeans_params.hpp b/cpp/include/raft/cluster/kmeans_params.hpp index 70ea49d36d..433e32f5ff 100644 --- a/cpp/include/raft/cluster/kmeans_params.hpp +++ b/cpp/include/raft/cluster/kmeans_params.hpp @@ -1,5 +1,5 @@ /* - * Copyright (c) 2022, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,61 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -#pragma once -#include -#include -#include - -namespace raft { -namespace cluster { - -struct KMeansParams { - enum InitMethod { KMeansPlusPlus, Random, Array }; - - // The number of clusters to form as well as the number of centroids to - // generate (default:8). - int n_clusters = 8; - - /* - * Method for initialization, defaults to k-means++: - * - InitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm - * to select the initial cluster centers. - * - InitMethod::Random (random): Choose 'n_clusters' observations (rows) at - * random from the input data for the initial centroids. - * - InitMethod::Array (ndarray): Use 'centroids' as initial cluster centers. - */ - InitMethod init = KMeansPlusPlus; - - // Maximum number of iterations of the k-means algorithm for a single run. - int max_iter = 300; - - // Relative tolerance with regards to inertia to declare convergence. - double tol = 1e-4; - - // verbosity level. - int verbosity = RAFT_LEVEL_INFO; - - // Seed to the random number generator. - raft::random::RngState rng_state = - raft::random::RngState(0, raft::random::GeneratorType::GenPhilox); - - // Metric to use for distance computation. - raft::distance::DistanceType metric = raft::distance::DistanceType::L2Expanded; +/** + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. + */ - // Number of instance k-means algorithm will be run with different seeds. - int n_init = 1; +/** + * DISCLAIMER: this file is deprecated: use lap.cuh instead + */ - // Oversampling factor for use in the k-means|| algorithm. - double oversampling_factor = 2.0; +#pragma once - // batch_samples and batch_centroids are used to tile 1NN computation which is - // useful to optimize/control the memory footprint - // Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 - // then don't tile the centroids - int batch_samples = 1 << 15; - int batch_centroids = 0; // if 0 then batch_centroids = n_clusters +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/cluster/kmeans_types.hpp version instead.") - bool inertia_check = false; -}; -} // namespace cluster -} // namespace raft +#include diff --git a/cpp/include/raft/cluster/kmeans_types.hpp b/cpp/include/raft/cluster/kmeans_types.hpp new file mode 100644 index 0000000000..87fc7c1880 --- /dev/null +++ b/cpp/include/raft/cluster/kmeans_types.hpp @@ -0,0 +1,73 @@ +/* + * Copyright (c) 2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#pragma once +#include +#include +#include + +namespace raft { +namespace cluster { + +struct KMeansParams { + enum InitMethod { KMeansPlusPlus, Random, Array }; + + // The number of clusters to form as well as the number of centroids to + // generate (default:8). + int n_clusters = 8; + + /* + * Method for initialization, defaults to k-means++: + * - InitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm + * to select the initial cluster centers. + * - InitMethod::Random (random): Choose 'n_clusters' observations (rows) at + * random from the input data for the initial centroids. + * - InitMethod::Array (ndarray): Use 'centroids' as initial cluster centers. + */ + InitMethod init = KMeansPlusPlus; + + // Maximum number of iterations of the k-means algorithm for a single run. + int max_iter = 300; + + // Relative tolerance with regards to inertia to declare convergence. + double tol = 1e-4; + + // verbosity level. + int verbosity = RAFT_LEVEL_INFO; + + // Seed to the random number generator. + raft::random::RngState rng_state = + raft::random::RngState(0, raft::random::GeneratorType::GenPhilox); + + // Metric to use for distance computation. + raft::distance::DistanceType metric = raft::distance::DistanceType::L2Expanded; + + // Number of instance k-means algorithm will be run with different seeds. + int n_init = 1; + + // Oversampling factor for use in the k-means|| algorithm. + double oversampling_factor = 2.0; + + // batch_samples and batch_centroids are used to tile 1NN computation which is + // useful to optimize/control the memory footprint + // Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 + // then don't tile the centroids + int batch_samples = 1 << 15; + int batch_centroids = 0; // if 0 then batch_centroids = n_clusters + + bool inertia_check = false; +}; +} // namespace cluster +} // namespace raft diff --git a/cpp/include/raft/cluster/single_linkage.cuh b/cpp/include/raft/cluster/single_linkage.cuh new file mode 100644 index 0000000000..98735c74e4 --- /dev/null +++ b/cpp/include/raft/cluster/single_linkage.cuh @@ -0,0 +1,58 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#pragma once + +#include +#include + +namespace raft::cluster { + +/** + * Single-linkage clustering, capable of constructing a KNN graph to + * scale the algorithm beyond the n^2 memory consumption of implementations + * that use the fully-connected graph of pairwise distances by connecting + * a knn graph when k is not large enough to connect it. + + * @tparam value_idx + * @tparam value_t + * @tparam dist_type method to use for constructing connectivities graph + * @param[in] handle raft handle + * @param[in] X dense input matrix in row-major layout + * @param[in] m number of rows in X + * @param[in] n number of columns in X + * @param[in] metric distance metrix to use when constructing connectivities graph + * @param[out] out struct containing output dendrogram and cluster assignments + * @param[in] c a constant used when constructing connectivities from knn graph. Allows the indirect + control + * of k. The algorithm will set `k = log(n) + c` + * @param[in] n_clusters number of clusters to assign data samples + */ +template +void single_linkage(const raft::handle_t& handle, + const value_t* X, + size_t m, + size_t n, + raft::distance::DistanceType metric, + linkage_output* out, + int c, + size_t n_clusters) +{ + detail::single_linkage( + handle, X, m, n, metric, out, c, n_clusters); +} +}; // namespace raft::cluster diff --git a/cpp/include/raft/cluster/single_linkage_types.hpp b/cpp/include/raft/cluster/single_linkage_types.hpp new file mode 100644 index 0000000000..1c35cf5c68 --- /dev/null +++ b/cpp/include/raft/cluster/single_linkage_types.hpp @@ -0,0 +1,49 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +namespace raft::cluster { + +enum LinkageDistance { PAIRWISE = 0, KNN_GRAPH = 1 }; + +/** + * Simple POCO for consolidating linkage results. This closely + * mirrors the trained instance variables populated in + * Scikit-learn's AgglomerativeClustering estimator. + * @tparam value_idx + * @tparam value_t + */ +template +class linkage_output { + public: + value_idx m; + value_idx n_clusters; + + value_idx n_leaves; + value_idx n_connected_components; + + value_idx* labels; // size: m + + value_idx* children; // size: (m-1, 2) +}; + +class linkage_output_int_float : public linkage_output { +}; +class linkage_output__int64_float : public linkage_output { +}; + +}; // namespace raft::cluster \ No newline at end of file diff --git a/cpp/include/raft/common/cub_wrappers.cuh b/cpp/include/raft/common/cub_wrappers.cuh index 32a46968b6..e80d7cccd9 100644 --- a/cpp/include/raft/common/cub_wrappers.cuh +++ b/cpp/include/raft/common/cub_wrappers.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2019, NVIDIA CORPORATION. + * Copyright (c) 2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,41 +13,20 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -#include -#include - -namespace raft { +/** + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. + */ /** - * @brief Convenience wrapper over cub's SortPairs method - * @tparam KeyT key type - * @tparam ValueT value type - * @param workspace workspace buffer which will get resized if not enough space - * @param inKeys input keys array - * @param outKeys output keys array - * @param inVals input values array - * @param outVals output values array - * @param len array length - * @param stream cuda stream + * DISCLAIMER: this file is deprecated: use lanczos.cuh instead */ -template -void sortPairs(rmm::device_uvector& workspace, - const KeyT* inKeys, - KeyT* outKeys, - const ValueT* inVals, - ValueT* outVals, - int len, - cudaStream_t stream) -{ - size_t worksize; - cub::DeviceRadixSort::SortPairs( - nullptr, worksize, inKeys, outKeys, inVals, outVals, len, 0, sizeof(KeyT) * 8, stream); - workspace.resize(worksize, stream); - cub::DeviceRadixSort::SortPairs( - workspace.data(), worksize, inKeys, outKeys, inVals, outVals, len, 0, sizeof(KeyT) * 8, stream); -} -} // namespace raft +#pragma once + +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please note that there is no equivalent in RAFT's public API" + " so this file will eventually be removed altogether.") + +#include diff --git a/cpp/include/raft/common/detail/scatter.cuh b/cpp/include/raft/common/detail/scatter.cuh index 4087625320..87a8826aa6 100644 --- a/cpp/include/raft/common/detail/scatter.cuh +++ b/cpp/include/raft/common/detail/scatter.cuh @@ -16,8 +16,8 @@ #pragma once -#include -#include +#include +#include namespace raft::detail { diff --git a/cpp/include/raft/common/device_loads_stores.cuh b/cpp/include/raft/common/device_loads_stores.cuh index 0c4750aa69..f3cfbd81cc 100644 --- a/cpp/include/raft/common/device_loads_stores.cuh +++ b/cpp/include/raft/common/device_loads_stores.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,526 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -#include - -namespace raft { - /** - * @defgroup SmemStores Shared memory store operations - * @{ - * @brief Stores to shared memory (both vectorized and non-vectorized forms) - * requires the given shmem pointer to be aligned by the vector - length, like for float4 lds/sts shmem pointer should be aligned - by 16 bytes else it might silently fail or can also give - runtime error. - * @param[out] addr shared memory address (should be aligned to vector size) - * @param[in] x data to be stored at this address + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -DI void sts(uint8_t* addr, const uint8_t& x) -{ - uint32_t x_int; - x_int = x; - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.u8 [%0], {%1};" : : "l"(s1), "r"(x_int)); -} -DI void sts(uint8_t* addr, const uint8_t (&x)[1]) -{ - uint32_t x_int[1]; - x_int[0] = x[0]; - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.u8 [%0], {%1};" : : "l"(s1), "r"(x_int[0])); -} -DI void sts(uint8_t* addr, const uint8_t (&x)[2]) -{ - uint32_t x_int[2]; - x_int[0] = x[0]; - x_int[1] = x[1]; - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v2.u8 [%0], {%1, %2};" : : "l"(s2), "r"(x_int[0]), "r"(x_int[1])); -} -DI void sts(uint8_t* addr, const uint8_t (&x)[4]) -{ - uint32_t x_int[4]; - x_int[0] = x[0]; - x_int[1] = x[1]; - x_int[2] = x[2]; - x_int[3] = x[3]; - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v4.u8 [%0], {%1, %2, %3, %4};" - : - : "l"(s4), "r"(x_int[0]), "r"(x_int[1]), "r"(x_int[2]), "r"(x_int[3])); -} - -DI void sts(int8_t* addr, const int8_t& x) -{ - int32_t x_int = x; - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.s8 [%0], {%1};" : : "l"(s1), "r"(x_int)); -} -DI void sts(int8_t* addr, const int8_t (&x)[1]) -{ - int32_t x_int[1]; - x_int[0] = x[0]; - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.s8 [%0], {%1};" : : "l"(s1), "r"(x_int[0])); -} -DI void sts(int8_t* addr, const int8_t (&x)[2]) -{ - int32_t x_int[2]; - x_int[0] = x[0]; - x_int[1] = x[1]; - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v2.s8 [%0], {%1, %2};" : : "l"(s2), "r"(x_int[0]), "r"(x_int[1])); -} -DI void sts(int8_t* addr, const int8_t (&x)[4]) -{ - int32_t x_int[4]; - x_int[0] = x[0]; - x_int[1] = x[1]; - x_int[2] = x[2]; - x_int[3] = x[3]; - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v4.s8 [%0], {%1, %2, %3, %4};" - : - : "l"(s4), "r"(x_int[0]), "r"(x_int[1]), "r"(x_int[2]), "r"(x_int[3])); -} - -DI void sts(uint32_t* addr, const uint32_t& x) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.u32 [%0], {%1};" : : "l"(s1), "r"(x)); -} -DI void sts(uint32_t* addr, const uint32_t (&x)[1]) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.u32 [%0], {%1};" : : "l"(s1), "r"(x[0])); -} -DI void sts(uint32_t* addr, const uint32_t (&x)[2]) -{ - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v2.u32 [%0], {%1, %2};" : : "l"(s2), "r"(x[0]), "r"(x[1])); -} -DI void sts(uint32_t* addr, const uint32_t (&x)[4]) -{ - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v4.u32 [%0], {%1, %2, %3, %4};" - : - : "l"(s4), "r"(x[0]), "r"(x[1]), "r"(x[2]), "r"(x[3])); -} - -DI void sts(int32_t* addr, const int32_t& x) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.u32 [%0], {%1};" : : "l"(s1), "r"(x)); -} -DI void sts(int32_t* addr, const int32_t (&x)[1]) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.u32 [%0], {%1};" : : "l"(s1), "r"(x[0])); -} -DI void sts(int32_t* addr, const int32_t (&x)[2]) -{ - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v2.u32 [%0], {%1, %2};" : : "l"(s2), "r"(x[0]), "r"(x[1])); -} -DI void sts(int32_t* addr, const int32_t (&x)[4]) -{ - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v4.u32 [%0], {%1, %2, %3, %4};" - : - : "l"(s4), "r"(x[0]), "r"(x[1]), "r"(x[2]), "r"(x[3])); -} - -DI void sts(float* addr, const float& x) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.f32 [%0], {%1};" : : "l"(s1), "f"(x)); -} -DI void sts(float* addr, const float (&x)[1]) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.f32 [%0], {%1};" : : "l"(s1), "f"(x[0])); -} -DI void sts(float* addr, const float (&x)[2]) -{ - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v2.f32 [%0], {%1, %2};" : : "l"(s2), "f"(x[0]), "f"(x[1])); -} -DI void sts(float* addr, const float (&x)[4]) -{ - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v4.f32 [%0], {%1, %2, %3, %4};" - : - : "l"(s4), "f"(x[0]), "f"(x[1]), "f"(x[2]), "f"(x[3])); -} - -DI void sts(double* addr, const double& x) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.f64 [%0], {%1};" : : "l"(s1), "d"(x)); -} -DI void sts(double* addr, const double (&x)[1]) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.f64 [%0], {%1};" : : "l"(s1), "d"(x[0])); -} -DI void sts(double* addr, const double (&x)[2]) -{ - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("st.shared.v2.f64 [%0], {%1, %2};" : : "l"(s2), "d"(x[0]), "d"(x[1])); -} -/** @} */ /** - * @defgroup SmemLoads Shared memory load operations - * @{ - * @brief Loads from shared memory (both vectorized and non-vectorized forms) - requires the given shmem pointer to be aligned by the vector - length, like for float4 lds/sts shmem pointer should be aligned - by 16 bytes else it might silently fail or can also give - runtime error. - * @param[out] x the data to be loaded - * @param[in] addr shared memory address from where to load - * (should be aligned to vector size) + * DISCLAIMER: this file is deprecated: use lap.cuh instead */ -DI void lds(uint8_t& x, const uint8_t* addr) -{ - uint32_t x_int; - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.u8 {%0}, [%1];" : "=r"(x_int) : "l"(s1)); - x = x_int; -} -DI void lds(uint8_t (&x)[1], const uint8_t* addr) -{ - uint32_t x_int[1]; - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.u8 {%0}, [%1];" : "=r"(x_int[0]) : "l"(s1)); - x[0] = x_int[0]; -} -DI void lds(uint8_t (&x)[2], const uint8_t* addr) -{ - uint32_t x_int[2]; - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v2.u8 {%0, %1}, [%2];" : "=r"(x_int[0]), "=r"(x_int[1]) : "l"(s2)); - x[0] = x_int[0]; - x[1] = x_int[1]; -} -DI void lds(uint8_t (&x)[4], const uint8_t* addr) -{ - uint32_t x_int[4]; - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v4.u8 {%0, %1, %2, %3}, [%4];" - : "=r"(x_int[0]), "=r"(x_int[1]), "=r"(x_int[2]), "=r"(x_int[3]) - : "l"(s4)); - x[0] = x_int[0]; - x[1] = x_int[1]; - x[2] = x_int[2]; - x[3] = x_int[3]; -} - -DI void lds(int8_t& x, const int8_t* addr) -{ - int32_t x_int; - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.s8 {%0}, [%1];" : "=r"(x_int) : "l"(s1)); - x = x_int; -} -DI void lds(int8_t (&x)[1], const int8_t* addr) -{ - int32_t x_int[1]; - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.s8 {%0}, [%1];" : "=r"(x_int[0]) : "l"(s1)); - x[0] = x_int[0]; -} -DI void lds(int8_t (&x)[2], const int8_t* addr) -{ - int32_t x_int[2]; - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v2.s8 {%0, %1}, [%2];" : "=r"(x_int[0]), "=r"(x_int[1]) : "l"(s2)); - x[0] = x_int[0]; - x[1] = x_int[1]; -} -DI void lds(int8_t (&x)[4], const int8_t* addr) -{ - int32_t x_int[4]; - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v4.s8 {%0, %1, %2, %3}, [%4];" - : "=r"(x_int[0]), "=r"(x_int[1]), "=r"(x_int[2]), "=r"(x_int[3]) - : "l"(s4)); - x[0] = x_int[0]; - x[1] = x_int[1]; - x[2] = x_int[2]; - x[3] = x_int[3]; -} - -DI void lds(uint32_t (&x)[4], const uint32_t* addr) -{ - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v4.u32 {%0, %1, %2, %3}, [%4];" - : "=r"(x[0]), "=r"(x[1]), "=r"(x[2]), "=r"(x[3]) - : "l"(s4)); -} - -DI void lds(uint32_t (&x)[2], const uint32_t* addr) -{ - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v2.u32 {%0, %1}, [%2];" : "=r"(x[0]), "=r"(x[1]) : "l"(s2)); -} - -DI void lds(uint32_t (&x)[1], const uint32_t* addr) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.u32 {%0}, [%1];" : "=r"(x[0]) : "l"(s1)); -} - -DI void lds(uint32_t& x, const uint32_t* addr) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.u32 {%0}, [%1];" : "=r"(x) : "l"(s1)); -} - -DI void lds(int32_t (&x)[4], const int32_t* addr) -{ - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v4.u32 {%0, %1, %2, %3}, [%4];" - : "=r"(x[0]), "=r"(x[1]), "=r"(x[2]), "=r"(x[3]) - : "l"(s4)); -} - -DI void lds(int32_t (&x)[2], const int32_t* addr) -{ - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v2.u32 {%0, %1}, [%2];" : "=r"(x[0]), "=r"(x[1]) : "l"(s2)); -} - -DI void lds(int32_t (&x)[1], const int32_t* addr) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.u32 {%0}, [%1];" : "=r"(x[0]) : "l"(s1)); -} - -DI void lds(int32_t& x, const int32_t* addr) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.u32 {%0}, [%1];" : "=r"(x) : "l"(s1)); -} - -DI void lds(float& x, const float* addr) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.f32 {%0}, [%1];" : "=f"(x) : "l"(s1)); -} -DI void lds(float (&x)[1], const float* addr) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.f32 {%0}, [%1];" : "=f"(x[0]) : "l"(s1)); -} -DI void lds(float (&x)[2], const float* addr) -{ - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v2.f32 {%0, %1}, [%2];" : "=f"(x[0]), "=f"(x[1]) : "l"(s2)); -} -DI void lds(float (&x)[4], const float* addr) -{ - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v4.f32 {%0, %1, %2, %3}, [%4];" - : "=f"(x[0]), "=f"(x[1]), "=f"(x[2]), "=f"(x[3]) - : "l"(s4)); -} - -DI void lds(float& x, float* addr) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.f32 {%0}, [%1];" : "=f"(x) : "l"(s1)); -} -DI void lds(float (&x)[1], float* addr) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.f32 {%0}, [%1];" : "=f"(x[0]) : "l"(s1)); -} -DI void lds(float (&x)[2], float* addr) -{ - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v2.f32 {%0, %1}, [%2];" : "=f"(x[0]), "=f"(x[1]) : "l"(s2)); -} -DI void lds(float (&x)[4], float* addr) -{ - auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v4.f32 {%0, %1, %2, %3}, [%4];" - : "=f"(x[0]), "=f"(x[1]), "=f"(x[2]), "=f"(x[3]) - : "l"(s4)); -} -DI void lds(double& x, double* addr) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.f64 {%0}, [%1];" : "=d"(x) : "l"(s1)); -} -DI void lds(double (&x)[1], double* addr) -{ - auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.f64 {%0}, [%1];" : "=d"(x[0]) : "l"(s1)); -} -DI void lds(double (&x)[2], double* addr) -{ - auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); - asm volatile("ld.shared.v2.f64 {%0, %1}, [%2];" : "=d"(x[0]), "=d"(x[1]) : "l"(s2)); -} -/** @} */ - -/** - * @defgroup GlobalLoads Global cached load operations - * @{ - * @brief Load from global memory with caching at L1 level - * @param[out] x data to be loaded from global memory - * @param[in] addr address in global memory from where to load - */ -DI void ldg(float& x, const float* addr) -{ - asm volatile("ld.global.cg.f32 %0, [%1];" : "=f"(x) : "l"(addr)); -} -DI void ldg(float (&x)[1], const float* addr) -{ - asm volatile("ld.global.cg.f32 %0, [%1];" : "=f"(x[0]) : "l"(addr)); -} -DI void ldg(float (&x)[2], const float* addr) -{ - asm volatile("ld.global.cg.v2.f32 {%0, %1}, [%2];" : "=f"(x[0]), "=f"(x[1]) : "l"(addr)); -} -DI void ldg(float (&x)[4], const float* addr) -{ - asm volatile("ld.global.cg.v4.f32 {%0, %1, %2, %3}, [%4];" - : "=f"(x[0]), "=f"(x[1]), "=f"(x[2]), "=f"(x[3]) - : "l"(addr)); -} -DI void ldg(double& x, const double* addr) -{ - asm volatile("ld.global.cg.f64 %0, [%1];" : "=d"(x) : "l"(addr)); -} -DI void ldg(double (&x)[1], const double* addr) -{ - asm volatile("ld.global.cg.f64 %0, [%1];" : "=d"(x[0]) : "l"(addr)); -} -DI void ldg(double (&x)[2], const double* addr) -{ - asm volatile("ld.global.cg.v2.f64 {%0, %1}, [%2];" : "=d"(x[0]), "=d"(x[1]) : "l"(addr)); -} - -DI void ldg(uint32_t (&x)[4], const uint32_t* const& addr) -{ - asm volatile("ld.global.cg.v4.u32 {%0, %1, %2, %3}, [%4];" - : "=r"(x[0]), "=r"(x[1]), "=r"(x[2]), "=r"(x[3]) - : "l"(addr)); -} - -DI void ldg(uint32_t (&x)[2], const uint32_t* const& addr) -{ - asm volatile("ld.global.cg.v2.u32 {%0, %1}, [%2];" : "=r"(x[0]), "=r"(x[1]) : "l"(addr)); -} - -DI void ldg(uint32_t (&x)[1], const uint32_t* const& addr) -{ - asm volatile("ld.global.cg.u32 %0, [%1];" : "=r"(x[0]) : "l"(addr)); -} - -DI void ldg(uint32_t& x, const uint32_t* const& addr) -{ - asm volatile("ld.global.cg.u32 %0, [%1];" : "=r"(x) : "l"(addr)); -} - -DI void ldg(int32_t (&x)[4], const int32_t* const& addr) -{ - asm volatile("ld.global.cg.v4.u32 {%0, %1, %2, %3}, [%4];" - : "=r"(x[0]), "=r"(x[1]), "=r"(x[2]), "=r"(x[3]) - : "l"(addr)); -} - -DI void ldg(int32_t (&x)[2], const int32_t* const& addr) -{ - asm volatile("ld.global.cg.v2.u32 {%0, %1}, [%2];" : "=r"(x[0]), "=r"(x[1]) : "l"(addr)); -} - -DI void ldg(int32_t (&x)[1], const int32_t* const& addr) -{ - asm volatile("ld.global.cg.u32 %0, [%1];" : "=r"(x[0]) : "l"(addr)); -} - -DI void ldg(int32_t& x, const int32_t* const& addr) -{ - asm volatile("ld.global.cg.u32 %0, [%1];" : "=r"(x) : "l"(addr)); -} - -DI void ldg(uint8_t (&x)[4], const uint8_t* const& addr) -{ - uint32_t x_int[4]; - asm volatile("ld.global.cg.v4.u8 {%0, %1, %2, %3}, [%4];" - : "=r"(x_int[0]), "=r"(x_int[1]), "=r"(x_int[2]), "=r"(x_int[3]) - : "l"(addr)); - x[0] = x_int[0]; - x[1] = x_int[1]; - x[2] = x_int[2]; - x[3] = x_int[3]; -} - -DI void ldg(uint8_t (&x)[2], const uint8_t* const& addr) -{ - uint32_t x_int[2]; - asm volatile("ld.global.cg.v2.u8 {%0, %1}, [%2];" : "=r"(x_int[0]), "=r"(x_int[1]) : "l"(addr)); - x[0] = x_int[0]; - x[1] = x_int[1]; -} - -DI void ldg(uint8_t (&x)[1], const uint8_t* const& addr) -{ - uint32_t x_int; - asm volatile("ld.global.cg.u8 %0, [%1];" : "=r"(x_int) : "l"(addr)); - x[0] = x_int; -} - -DI void ldg(uint8_t& x, const uint8_t* const& addr) -{ - uint32_t x_int; - asm volatile("ld.global.cg.u8 %0, [%1];" : "=r"(x_int) : "l"(addr)); - x = x_int; -} - -DI void ldg(int8_t (&x)[4], const int8_t* const& addr) -{ - int x_int[4]; - asm volatile("ld.global.cg.v4.s8 {%0, %1, %2, %3}, [%4];" - : "=r"(x_int[0]), "=r"(x_int[1]), "=r"(x_int[2]), "=r"(x_int[3]) - : "l"(addr)); - x[0] = x_int[0]; - x[1] = x_int[1]; - x[2] = x_int[2]; - x[3] = x_int[3]; -} - -DI void ldg(int8_t (&x)[2], const int8_t* const& addr) -{ - int x_int[2]; - asm volatile("ld.global.cg.v2.s8 {%0, %1}, [%2];" : "=r"(x_int[0]), "=r"(x_int[1]) : "l"(addr)); - x[0] = x_int[0]; - x[1] = x_int[1]; -} - -DI void ldg(int8_t& x, const int8_t* const& addr) -{ - int x_int; - asm volatile("ld.global.cg.s8 %0, [%1];" : "=r"(x_int) : "l"(addr)); - x = x_int; -} - -DI void ldg(int8_t (&x)[1], const int8_t* const& addr) -{ - int x_int; - asm volatile("ld.global.cg.s8 %0, [%1];" : "=r"(x_int) : "l"(addr)); - x[0] = x_int; -} +#pragma once -/** @} */ +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/util version instead.") -} // namespace raft +#include diff --git a/cpp/include/raft/common/scatter.cuh b/cpp/include/raft/common/scatter.cuh index 9735ccdf2b..0e83f9a5cd 100644 --- a/cpp/include/raft/common/scatter.cuh +++ b/cpp/include/raft/common/scatter.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,56 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -#include -#include - -namespace raft { +/** + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. + */ /** - * @brief Performs scatter operation based on the input indexing array - * @tparam DataT data type whose array gets scattered - * @tparam IdxT indexing type - * @tparam TPB threads-per-block in the final kernel launched - * @tparam Lambda the device-lambda performing a unary operation on the loaded - * data before it gets scattered - * @param out the output array - * @param in the input array - * @param idx the indexing array - * @param len number of elements in the input array - * @param stream cuda stream where to launch work - * @param op the device-lambda with signature `DataT func(DataT, IdxT);`. This - * will be applied to every element before scattering it to the right location. - * The second param in this method will be the destination index. + * DISCLAIMER: this file is deprecated: use lap.cuh instead */ -template , int TPB = 256> -void scatter(DataT* out, - const DataT* in, - const IdxT* idx, - IdxT len, - cudaStream_t stream, - Lambda op = raft::Nop()) -{ - if (len <= 0) return; - constexpr size_t DataSize = sizeof(DataT); - constexpr size_t IdxSize = sizeof(IdxT); - constexpr size_t MaxPerElem = DataSize > IdxSize ? DataSize : IdxSize; - size_t bytes = len * MaxPerElem; - if (16 / MaxPerElem && bytes % 16 == 0) { - detail::scatterImpl(out, in, idx, len, op, stream); - } else if (8 / MaxPerElem && bytes % 8 == 0) { - detail::scatterImpl(out, in, idx, len, op, stream); - } else if (4 / MaxPerElem && bytes % 4 == 0) { - detail::scatterImpl(out, in, idx, len, op, stream); - } else if (2 / MaxPerElem && bytes % 2 == 0) { - detail::scatterImpl(out, in, idx, len, op, stream); - } else if (1 / MaxPerElem) { - detail::scatterImpl(out, in, idx, len, op, stream); - } else { - detail::scatterImpl(out, in, idx, len, op, stream); - } -} -} // namespace raft +#pragma once + +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/matrix version instead.") + +#include diff --git a/cpp/include/raft/common/seive.hpp b/cpp/include/raft/common/seive.hpp index e613f1e5c2..633c8dd3e1 100644 --- a/cpp/include/raft/common/seive.hpp +++ b/cpp/include/raft/common/seive.hpp @@ -1,5 +1,5 @@ /* - * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,113 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -#pragma once - -#include -#include - -// Taken from: -// https://github.com/teju85/programming/blob/master/euler/include/seive.h - -namespace raft { -namespace common { - /** - * @brief Implementation of 'Seive of Eratosthenes' + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -class Seive { - public: - /** - * @param _num number of integers for which seive is needed - */ - Seive(unsigned _num) - { - N = _num; - generateSeive(); - } - - /** - * @brief Check whether a number is prime or not - * @param num number to be checked - * @return true if the 'num' is prime, else false - */ - bool isPrime(unsigned num) const - { - unsigned mask, pos; - if (num <= 1) { return false; } - if (num == 2) { return true; } - if (!(num & 1)) { return false; } - getMaskPos(num, mask, pos); - return (seive[pos] & mask); - } - private: - void generateSeive() - { - auto sqN = fastIntSqrt(N); - auto size = raft::ceildiv(N, sizeof(unsigned) * 8); - seive.resize(size); - // assume all to be primes initially - for (auto& itr : seive) { - itr = 0xffffffffu; - } - unsigned cid = 0; - unsigned cnum = getNum(cid); - while (cnum <= sqN) { - do { - ++cid; - cnum = getNum(cid); - if (isPrime(cnum)) { break; } - } while (cnum <= sqN); - auto cnum2 = cnum << 1; - // 'unmark' all the 'odd' multiples of the current prime - for (unsigned i = 3, num = i * cnum; num <= N; i += 2, num += cnum2) { - unmark(num); - } - } - } - - unsigned getId(unsigned num) const { return (num >> 1); } - - unsigned getNum(unsigned id) const - { - if (id == 0) { return 2; } - return ((id << 1) + 1); - } - - void getMaskPos(unsigned num, unsigned& mask, unsigned& pos) const - { - pos = getId(num); - mask = 1 << (pos & 0x1f); - pos >>= 5; - } +/** + * DISCLAIMER: this file is deprecated: use lap.cuh instead + */ - void unmark(unsigned num) - { - unsigned mask, pos; - getMaskPos(num, mask, pos); - seive[pos] &= ~mask; - } +#pragma once - // REF: http://www.azillionmonkeys.com/qed/ulerysqroot.pdf - unsigned fastIntSqrt(unsigned val) - { - unsigned g = 0; - auto bshft = 15u, b = 1u << bshft; - do { - unsigned temp = ((g << 1) + b) << bshft--; - if (val >= temp) { - g += b; - val -= temp; - } - } while (b >>= 1); - return g; - } +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/util version instead.") - /** find all primes till this number */ - unsigned N; - /** the seive */ - std::vector seive; -}; -}; // namespace common -}; // namespace raft +#include diff --git a/cpp/include/raft/comms/comms_test.hpp b/cpp/include/raft/comms/comms_test.hpp index f01060cb40..c7e5dd3ab6 100644 --- a/cpp/include/raft/comms/comms_test.hpp +++ b/cpp/include/raft/comms/comms_test.hpp @@ -19,7 +19,7 @@ #include #include -#include +#include namespace raft { namespace comms { diff --git a/cpp/include/raft/comms/detail/mpi_comms.hpp b/cpp/include/raft/comms/detail/mpi_comms.hpp index 3bf5438296..508a9ce717 100644 --- a/cpp/include/raft/comms/detail/mpi_comms.hpp +++ b/cpp/include/raft/comms/detail/mpi_comms.hpp @@ -28,9 +28,9 @@ #include #include -#include -#include -#include +#include +#include +#include #include #include diff --git a/cpp/include/raft/comms/detail/std_comms.hpp b/cpp/include/raft/comms/detail/std_comms.hpp index 2be1310c50..e64c6d9bf0 100644 --- a/cpp/include/raft/comms/detail/std_comms.hpp +++ b/cpp/include/raft/comms/detail/std_comms.hpp @@ -20,13 +20,13 @@ #include #include -#include +#include #include #include -#include +#include -#include +#include #include diff --git a/cpp/include/raft/comms/detail/test.hpp b/cpp/include/raft/comms/detail/test.hpp index d81d7c80fb..6ba4be3886 100644 --- a/cpp/include/raft/comms/detail/test.hpp +++ b/cpp/include/raft/comms/detail/test.hpp @@ -17,7 +17,7 @@ #pragma once #include -#include +#include #include #include diff --git a/cpp/include/raft/comms/detail/ucp_helper.hpp b/cpp/include/raft/comms/detail/ucp_helper.hpp index 79976811ed..668acafae4 100644 --- a/cpp/include/raft/comms/detail/ucp_helper.hpp +++ b/cpp/include/raft/comms/detail/ucp_helper.hpp @@ -17,7 +17,7 @@ #pragma once #include -#include +#include #include #include #include diff --git a/cpp/include/raft/comms/detail/util.hpp b/cpp/include/raft/comms/detail/util.hpp index ff564603e1..969a8789dd 100644 --- a/cpp/include/raft/comms/detail/util.hpp +++ b/cpp/include/raft/comms/detail/util.hpp @@ -19,7 +19,7 @@ #include #include -#include +#include #include /** diff --git a/cpp/include/raft/comms/helper.hpp b/cpp/include/raft/comms/helper.hpp index b1aae86556..f6b63ac971 100644 --- a/cpp/include/raft/comms/helper.hpp +++ b/cpp/include/raft/comms/helper.hpp @@ -17,7 +17,7 @@ #pragma once #include -#include +#include #include #include diff --git a/cpp/include/raft/comms/std_comms.hpp b/cpp/include/raft/comms/std_comms.hpp index 7604606ba1..edace60fbd 100644 --- a/cpp/include/raft/comms/std_comms.hpp +++ b/cpp/include/raft/comms/std_comms.hpp @@ -16,7 +16,7 @@ #pragma once -#include +#include #include #include diff --git a/cpp/include/raft/core/comms.hpp b/cpp/include/raft/core/comms.hpp index 7f0aa74960..771f38fee3 100644 --- a/cpp/include/raft/core/comms.hpp +++ b/cpp/include/raft/core/comms.hpp @@ -17,7 +17,7 @@ #pragma once #include -#include +#include #include namespace raft { diff --git a/cpp/include/raft/core/cublas_macros.hpp b/cpp/include/raft/core/cublas_macros.hpp index f5de57677d..d2456433ab 100644 --- a/cpp/include/raft/core/cublas_macros.hpp +++ b/cpp/include/raft/core/cublas_macros.hpp @@ -20,7 +20,7 @@ #pragma once #include -#include +#include ///@todo: enable this once we have logger enabled //#include diff --git a/cpp/include/raft/core/cudart_utils.hpp b/cpp/include/raft/core/cudart_utils.hpp index e0957ea1f3..591f41629d 100644 --- a/cpp/include/raft/core/cudart_utils.hpp +++ b/cpp/include/raft/core/cudart_utils.hpp @@ -16,484 +16,8 @@ /** * This file is deprecated and will be removed in release 22.06. - * Please use raft_runtime/cudart_utils.hpp instead. + * Please use util/cudart_utils.hpp instead. */ -#ifndef __RAFT_RT_CUDART_UTILS_H -#define __RAFT_RT_CUDART_UTILS_H - #pragma once - -#include -#include -#include -#include -#include - -#include - -#include -#include -#include -#include -#include -#include -#include - -///@todo: enable once logging has been enabled in raft -//#include "logger.hpp" - -namespace raft { - -/** - * @brief Exception thrown when a CUDA error is encountered. - */ -struct cuda_error : public raft::exception { - explicit cuda_error(char const* const message) : raft::exception(message) {} - explicit cuda_error(std::string const& message) : raft::exception(message) {} -}; - -} // namespace raft - -/** - * @brief Error checking macro for CUDA runtime API functions. - * - * Invokes a CUDA runtime API function call, if the call does not return - * cudaSuccess, invokes cudaGetLastError() to clear the error and throws an - * exception detailing the CUDA error that occurred - * - */ -#define RAFT_CUDA_TRY(call) \ - do { \ - cudaError_t const status = call; \ - if (status != cudaSuccess) { \ - cudaGetLastError(); \ - std::string msg{}; \ - SET_ERROR_MSG(msg, \ - "CUDA error encountered at: ", \ - "call='%s', Reason=%s:%s", \ - #call, \ - cudaGetErrorName(status), \ - cudaGetErrorString(status)); \ - throw raft::cuda_error(msg); \ - } \ - } while (0) - -// FIXME: Remove after consumers rename -#ifndef CUDA_TRY -#define CUDA_TRY(call) RAFT_CUDA_TRY(call) -#endif - -/** - * @brief Debug macro to check for CUDA errors - * - * In a non-release build, this macro will synchronize the specified stream - * before error checking. In both release and non-release builds, this macro - * checks for any pending CUDA errors from previous calls. If an error is - * reported, an exception is thrown detailing the CUDA error that occurred. - * - * The intent of this macro is to provide a mechanism for synchronous and - * deterministic execution for debugging asynchronous CUDA execution. It should - * be used after any asynchronous CUDA call, e.g., cudaMemcpyAsync, or an - * asynchronous kernel launch. - */ -#ifndef NDEBUG -#define RAFT_CHECK_CUDA(stream) RAFT_CUDA_TRY(cudaStreamSynchronize(stream)); -#else -#define RAFT_CHECK_CUDA(stream) RAFT_CUDA_TRY(cudaPeekAtLastError()); -#endif - -// FIXME: Remove after consumers rename -#ifndef CHECK_CUDA -#define CHECK_CUDA(call) RAFT_CHECK_CUDA(call) -#endif - -/** FIXME: remove after cuml rename */ -#ifndef CUDA_CHECK -#define CUDA_CHECK(call) RAFT_CUDA_TRY(call) -#endif - -// /** -// * @brief check for cuda runtime API errors but log error instead of raising -// * exception. -// */ -#define RAFT_CUDA_TRY_NO_THROW(call) \ - do { \ - cudaError_t const status = call; \ - if (cudaSuccess != status) { \ - printf("CUDA call='%s' at file=%s line=%d failed with %s\n", \ - #call, \ - __FILE__, \ - __LINE__, \ - cudaGetErrorString(status)); \ - } \ - } while (0) - -// FIXME: Remove after cuml rename -#ifndef CUDA_CHECK_NO_THROW -#define CUDA_CHECK_NO_THROW(call) RAFT_CUDA_TRY_NO_THROW(call) -#endif - -/** - * Alias to raft scope for now. - * TODO: Rename original implementations in 22.04 to fix - * https://github.com/rapidsai/raft/issues/128 - */ - -namespace raft { - -/** Helper method to get to know warp size in device code */ -__host__ __device__ constexpr inline int warp_size() { return 32; } - -__host__ __device__ constexpr inline unsigned int warp_full_mask() { return 0xffffffff; } - -/** - * @brief A kernel grid configuration construction gadget for simple one-dimensional mapping - * elements to threads. - */ -class grid_1d_thread_t { - public: - int const block_size{0}; - int const num_blocks{0}; - - /** - * @param overall_num_elements The number of elements the kernel needs to handle/process - * @param num_threads_per_block The grid block size, determined according to the kernel's - * specific features (amount of shared memory necessary, SM functional units use pattern etc.); - * this can't be determined generically/automatically (as opposed to the number of blocks) - * @param max_num_blocks_1d maximum number of blocks in 1d grid - * @param elements_per_thread Typically, a single kernel thread processes more than a single - * element; this affects the number of threads the grid must contain - */ - grid_1d_thread_t(size_t overall_num_elements, - size_t num_threads_per_block, - size_t max_num_blocks_1d, - size_t elements_per_thread = 1) - : block_size(num_threads_per_block), - num_blocks( - std::min((overall_num_elements + (elements_per_thread * num_threads_per_block) - 1) / - (elements_per_thread * num_threads_per_block), - max_num_blocks_1d)) - { - RAFT_EXPECTS(overall_num_elements > 0, "overall_num_elements must be > 0"); - RAFT_EXPECTS(num_threads_per_block / warp_size() > 0, - "num_threads_per_block / warp_size() must be > 0"); - RAFT_EXPECTS(elements_per_thread > 0, "elements_per_thread must be > 0"); - } -}; - -/** - * @brief A kernel grid configuration construction gadget for simple one-dimensional mapping - * elements to warps. - */ -class grid_1d_warp_t { - public: - int const block_size{0}; - int const num_blocks{0}; - - /** - * @param overall_num_elements The number of elements the kernel needs to handle/process - * @param num_threads_per_block The grid block size, determined according to the kernel's - * specific features (amount of shared memory necessary, SM functional units use pattern etc.); - * this can't be determined generically/automatically (as opposed to the number of blocks) - * @param max_num_blocks_1d maximum number of blocks in 1d grid - */ - grid_1d_warp_t(size_t overall_num_elements, - size_t num_threads_per_block, - size_t max_num_blocks_1d) - : block_size(num_threads_per_block), - num_blocks(std::min((overall_num_elements + (num_threads_per_block / warp_size()) - 1) / - (num_threads_per_block / warp_size()), - max_num_blocks_1d)) - { - RAFT_EXPECTS(overall_num_elements > 0, "overall_num_elements must be > 0"); - RAFT_EXPECTS(num_threads_per_block / warp_size() > 0, - "num_threads_per_block / warp_size() must be > 0"); - } -}; - -/** - * @brief A kernel grid configuration construction gadget for simple one-dimensional mapping - * elements to blocks. - */ -class grid_1d_block_t { - public: - int const block_size{0}; - int const num_blocks{0}; - - /** - * @param overall_num_elements The number of elements the kernel needs to handle/process - * @param num_threads_per_block The grid block size, determined according to the kernel's - * specific features (amount of shared memory necessary, SM functional units use pattern etc.); - * this can't be determined generically/automatically (as opposed to the number of blocks) - * @param max_num_blocks_1d maximum number of blocks in 1d grid - */ - grid_1d_block_t(size_t overall_num_elements, - size_t num_threads_per_block, - size_t max_num_blocks_1d) - : block_size(num_threads_per_block), - num_blocks(std::min(overall_num_elements, max_num_blocks_1d)) - { - RAFT_EXPECTS(overall_num_elements > 0, "overall_num_elements must be > 0"); - RAFT_EXPECTS(num_threads_per_block / warp_size() > 0, - "num_threads_per_block / warp_size() must be > 0"); - } -}; - -/** - * @brief Generic copy method for all kinds of transfers - * @tparam Type data type - * @param dst destination pointer - * @param src source pointer - * @param len lenth of the src/dst buffers in terms of number of elements - * @param stream cuda stream - */ -template -void copy(Type* dst, const Type* src, size_t len, rmm::cuda_stream_view stream) -{ - CUDA_CHECK(cudaMemcpyAsync(dst, src, len * sizeof(Type), cudaMemcpyDefault, stream)); -} - -/** - * @defgroup Copy Copy methods - * These are here along with the generic 'copy' method in order to improve - * code readability using explicitly specified function names - * @{ - */ -/** performs a host to device copy */ -template -void update_device(Type* d_ptr, const Type* h_ptr, size_t len, rmm::cuda_stream_view stream) -{ - copy(d_ptr, h_ptr, len, stream); -} - -/** performs a device to host copy */ -template -void update_host(Type* h_ptr, const Type* d_ptr, size_t len, rmm::cuda_stream_view stream) -{ - copy(h_ptr, d_ptr, len, stream); -} - -template -void copy_async(Type* d_ptr1, const Type* d_ptr2, size_t len, rmm::cuda_stream_view stream) -{ - CUDA_CHECK(cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream)); -} -/** @} */ - -/** - * @defgroup Debug Utils for debugging host/device buffers - * @{ - */ -template -void print_host_vector(const char* variable_name, - const T* host_mem, - size_t componentsCount, - OutStream& out) -{ - out << variable_name << "=["; - for (size_t i = 0; i < componentsCount; ++i) { - if (i != 0) out << ","; - out << host_mem[i]; - } - out << "];" << std::endl; -} - -template -void print_device_vector(const char* variable_name, - const T* devMem, - size_t componentsCount, - OutStream& out) -{ - auto host_mem = std::make_unique(componentsCount); - CUDA_CHECK( - cudaMemcpy(host_mem.get(), devMem, componentsCount * sizeof(T), cudaMemcpyDeviceToHost)); - print_host_vector(variable_name, host_mem.get(), componentsCount, out); -} - -/** - * @brief Print an array given a device or a host pointer. - * - * @param[in] variable_name - * @param[in] ptr any pointer (device/host/managed, etc) - * @param[in] componentsCount array length - * @param out the output stream - */ -template -void print_vector(const char* variable_name, const T* ptr, size_t componentsCount, OutStream& out) -{ - cudaPointerAttributes attr; - RAFT_CUDA_TRY(cudaPointerGetAttributes(&attr, ptr)); - if (attr.hostPointer != nullptr) { - print_host_vector(variable_name, reinterpret_cast(attr.hostPointer), componentsCount, out); - } else if (attr.type == cudaMemoryTypeUnregistered) { - print_host_vector(variable_name, ptr, componentsCount, out); - } else { - print_device_vector(variable_name, ptr, componentsCount, out); - } -} -/** @} */ - -/** helper method to get max usable shared mem per block parameter */ -inline int getSharedMemPerBlock() -{ - int devId; - RAFT_CUDA_TRY(cudaGetDevice(&devId)); - int smemPerBlk; - RAFT_CUDA_TRY(cudaDeviceGetAttribute(&smemPerBlk, cudaDevAttrMaxSharedMemoryPerBlock, devId)); - return smemPerBlk; -} - -/** helper method to get multi-processor count parameter */ -inline int getMultiProcessorCount() -{ - int devId; - RAFT_CUDA_TRY(cudaGetDevice(&devId)); - int mpCount; - RAFT_CUDA_TRY(cudaDeviceGetAttribute(&mpCount, cudaDevAttrMultiProcessorCount, devId)); - return mpCount; -} - -/** helper method to convert an array on device to a string on host */ -template -std::string arr2Str(const T* arr, int size, std::string name, cudaStream_t stream, int width = 4) -{ - std::stringstream ss; - - T* arr_h = (T*)malloc(size * sizeof(T)); - update_host(arr_h, arr, size, stream); - RAFT_CUDA_TRY(cudaStreamSynchronize(stream)); - - ss << name << " = [ "; - for (int i = 0; i < size; i++) { - ss << std::setw(width) << arr_h[i]; - - if (i < size - 1) ss << ", "; - } - ss << " ]" << std::endl; - - free(arr_h); - - return ss.str(); -} - -/** this seems to be unused, but may be useful in the future */ -template -void ASSERT_DEVICE_MEM(T* ptr, std::string name) -{ - cudaPointerAttributes s_att; - cudaError_t s_err = cudaPointerGetAttributes(&s_att, ptr); - - if (s_err != 0 || s_att.device == -1) - std::cout << "Invalid device pointer encountered in " << name << ". device=" << s_att.device - << ", err=" << s_err << std::endl; -} - -inline uint32_t curTimeMillis() -{ - auto now = std::chrono::high_resolution_clock::now(); - auto duration = now.time_since_epoch(); - return std::chrono::duration_cast(duration).count(); -} - -/** Helper function to calculate need memory for allocate to store dense matrix. - * @param rows number of rows in matrix - * @param columns number of columns in matrix - * @return need number of items to allocate via allocate() - * @sa allocate() - */ -inline size_t allocLengthForMatrix(size_t rows, size_t columns) { return rows * columns; } - -/** Helper function to check alignment of pointer. - * @param ptr the pointer to check - * @param alignment to be checked for - * @return true if address in bytes is a multiple of alignment - */ -template -bool is_aligned(Type* ptr, size_t alignment) -{ - return reinterpret_cast(ptr) % alignment == 0; -} - -/** calculate greatest common divisor of two numbers - * @a integer - * @b integer - * @ return gcd of a and b - */ -template -IntType gcd(IntType a, IntType b) -{ - while (b != 0) { - IntType tmp = b; - b = a % b; - a = tmp; - } - return a; -} - -template -constexpr T lower_bound() -{ - if constexpr (std::numeric_limits::has_infinity && std::numeric_limits::is_signed) { - return -std::numeric_limits::infinity(); - } - return std::numeric_limits::lowest(); -} - -template -constexpr T upper_bound() -{ - if constexpr (std::numeric_limits::has_infinity) { return std::numeric_limits::infinity(); } - return std::numeric_limits::max(); -} - -/** - * @brief Get a pointer to a pooled memory resource within the scope of the lifetime of the returned - * unique pointer. - * - * This function is useful in the code where multiple repeated allocations/deallocations are - * expected. - * Use case example: - * @code{.cpp} - * void my_func(..., size_t n, rmm::mr::device_memory_resource* mr = nullptr) { - * auto pool_guard = raft::get_pool_memory_resource(mr, 2 * n * sizeof(float)); - * if (pool_guard){ - * RAFT_LOG_INFO("Created a pool %zu bytes", pool_guard->pool_size()); - * } else { - * RAFT_LOG_INFO("Using the current default or explicitly passed device memory resource"); - * } - * rmm::device_uvector x(n, stream, mr); - * rmm::device_uvector y(n, stream, mr); - * ... - * } - * @endcode - * Here, the new memory resource would be created within the function scope if the passed `mr` is - * null and the default resource is not a pool. After the call, `mr` contains a valid memory - * resource in any case. - * - * @param[inout] mr if not null do nothing; otherwise get the current device resource and wrap it - * into a `pool_memory_resource` if neccessary and return the pointer to the result. - * @param initial_size if a new memory pool is created, this would be its initial size (rounded up - * to 256 bytes). - * - * @return if a new memory pool is created, it returns a unique_ptr to it; - * this managed pointer controls the lifetime of the created memory resource. - */ -inline auto get_pool_memory_resource(rmm::mr::device_memory_resource*& mr, size_t initial_size) -{ - using pool_res_t = rmm::mr::pool_memory_resource; - std::unique_ptr pool_res{}; - if (mr) return pool_res; - mr = rmm::mr::get_current_device_resource(); - if (!dynamic_cast(mr) && - !dynamic_cast*>(mr) && - !dynamic_cast*>(mr)) { - pool_res = std::make_unique(mr, (initial_size + 255) & (~255)); - mr = pool_res.get(); - } - return pool_res; -} - -} // namespace raft - -#endif +#include diff --git a/cpp/include/raft/core/cusolver_macros.hpp b/cpp/include/raft/core/cusolver_macros.hpp index b41927f5fb..505485e6a0 100644 --- a/cpp/include/raft/core/cusolver_macros.hpp +++ b/cpp/include/raft/core/cusolver_macros.hpp @@ -23,7 +23,7 @@ #include ///@todo: enable this once logging is enabled //#include -#include +#include #include #define _CUSOLVER_ERR_TO_STR(err) \ diff --git a/cpp/include/raft/core/cusparse_macros.hpp b/cpp/include/raft/core/cusparse_macros.hpp index 10c7e8836c..cf5195582b 100644 --- a/cpp/include/raft/core/cusparse_macros.hpp +++ b/cpp/include/raft/core/cusparse_macros.hpp @@ -17,7 +17,7 @@ #pragma once #include -#include +#include ///@todo: enable this once logging is enabled //#include diff --git a/cpp/include/raft/common/detail/callback_sink.hpp b/cpp/include/raft/core/detail/callback_sink.hpp similarity index 100% rename from cpp/include/raft/common/detail/callback_sink.hpp rename to cpp/include/raft/core/detail/callback_sink.hpp diff --git a/cpp/include/raft/common/detail/logger.hpp b/cpp/include/raft/core/detail/logger.hpp similarity index 100% rename from cpp/include/raft/common/detail/logger.hpp rename to cpp/include/raft/core/detail/logger.hpp diff --git a/cpp/include/raft/common/detail/nvtx.hpp b/cpp/include/raft/core/detail/nvtx.hpp similarity index 100% rename from cpp/include/raft/common/detail/nvtx.hpp rename to cpp/include/raft/core/detail/nvtx.hpp diff --git a/cpp/include/raft/core/interruptible.hpp b/cpp/include/raft/core/interruptible.hpp index 55d272739f..76fb7aa7c3 100644 --- a/cpp/include/raft/core/interruptible.hpp +++ b/cpp/include/raft/core/interruptible.hpp @@ -22,8 +22,8 @@ #include #include #include -#include -#include +#include +#include #include #include #include diff --git a/cpp/include/raft/core/logger.hpp b/cpp/include/raft/core/logger.hpp index 22e4dd7a90..44c8263abf 100644 --- a/cpp/include/raft/core/logger.hpp +++ b/cpp/include/raft/core/logger.hpp @@ -31,8 +31,8 @@ #include #define SPDLOG_HEADER_ONLY -#include -#include +#include +#include #include // NOLINT #include // NOLINT diff --git a/cpp/include/raft/core/nvtx.hpp b/cpp/include/raft/core/nvtx.hpp index eb536b0e01..3dbe1dd511 100644 --- a/cpp/include/raft/core/nvtx.hpp +++ b/cpp/include/raft/core/nvtx.hpp @@ -17,7 +17,7 @@ #pragma once #include -#include +#include /** * \section Usage diff --git a/cpp/include/raft/cuda_utils.cuh b/cpp/include/raft/cuda_utils.cuh index 2f0d417f90..6ce414aceb 100644 --- a/cpp/include/raft/cuda_utils.cuh +++ b/cpp/include/raft/cuda_utils.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2018-2022, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,782 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -#include -#include - -#include - -#ifndef ENABLE_MEMCPY_ASYNC -// enable memcpy_async interface by default for newer GPUs -#if __CUDA_ARCH__ >= 800 -#define ENABLE_MEMCPY_ASYNC 1 -#endif -#else // ENABLE_MEMCPY_ASYNC -// disable memcpy_async for all older GPUs -#if __CUDA_ARCH__ < 800 -#define ENABLE_MEMCPY_ASYNC 0 -#endif -#endif // ENABLE_MEMCPY_ASYNC - -namespace raft { - -/** helper macro for device inlined functions */ -#define DI inline __device__ -#define HDI inline __host__ __device__ -#define HD __host__ __device__ - -/** - * @brief Provide a ceiling division operation ie. ceil(a / b) - * @tparam IntType supposed to be only integers for now! - */ -template -constexpr HDI IntType ceildiv(IntType a, IntType b) -{ - return (a + b - 1) / b; -} - -/** - * @brief Provide an alignment function ie. ceil(a / b) * b - * @tparam IntType supposed to be only integers for now! - */ -template -constexpr HDI IntType alignTo(IntType a, IntType b) -{ - return ceildiv(a, b) * b; -} - -/** - * @brief Provide an alignment function ie. (a / b) * b - * @tparam IntType supposed to be only integers for now! - */ -template -constexpr HDI IntType alignDown(IntType a, IntType b) -{ - return (a / b) * b; -} - -/** - * @brief Check if the input is a power of 2 - * @tparam IntType data type (checked only for integers) - */ -template -constexpr HDI bool isPo2(IntType num) -{ - return (num && !(num & (num - 1))); -} - -/** - * @brief Give logarithm of the number to base-2 - * @tparam IntType data type (checked only for integers) - */ -template -constexpr HDI IntType log2(IntType num, IntType ret = IntType(0)) -{ - return num <= IntType(1) ? ret : log2(num >> IntType(1), ++ret); -} - -/** Device function to apply the input lambda across threads in the grid */ -template -DI void forEach(int num, L lambda) -{ - int idx = (blockDim.x * blockIdx.x) + threadIdx.x; - const int numThreads = blockDim.x * gridDim.x; -#pragma unroll - for (int itr = 0; itr < ItemsPerThread; ++itr, idx += numThreads) { - if (idx < num) lambda(idx, itr); - } -} - -/** number of threads per warp */ -static const int WarpSize = 32; - -/** get the laneId of the current thread */ -DI int laneId() -{ - int id; - asm("mov.s32 %0, %%laneid;" : "=r"(id)); - return id; -} - -/** - * @brief Swap two values - * @tparam T the datatype of the values - * @param a first input - * @param b second input - */ -template -HDI void swapVals(T& a, T& b) -{ - T tmp = a; - a = b; - b = tmp; -} - -/** Device function to have atomic add support for older archs */ -template -DI void myAtomicAdd(Type* address, Type val) -{ - atomicAdd(address, val); -} - -#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 600) -// Ref: -// http://on-demand.gputechconf.com/gtc/2013/presentations/S3101-Atomic-Memory-Operations.pdf -template <> -DI void myAtomicAdd(double* address, double val) -{ - unsigned long long int* address_as_ull = (unsigned long long int*)address; - unsigned long long int old = *address_as_ull, assumed; - do { - assumed = old; - old = - atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed))); - } while (assumed != old); -} -#endif - -template -DI void myAtomicReduce(T* address, T val, ReduceLambda op); - -template -DI void myAtomicReduce(double* address, double val, ReduceLambda op) -{ - unsigned long long int* address_as_ull = (unsigned long long int*)address; - unsigned long long int old = *address_as_ull, assumed; - do { - assumed = old; - old = atomicCAS( - address_as_ull, assumed, __double_as_longlong(op(val, __longlong_as_double(assumed)))); - } while (assumed != old); -} - -template -DI void myAtomicReduce(float* address, float val, ReduceLambda op) -{ - unsigned int* address_as_uint = (unsigned int*)address; - unsigned int old = *address_as_uint, assumed; - do { - assumed = old; - old = atomicCAS(address_as_uint, assumed, __float_as_uint(op(val, __uint_as_float(assumed)))); - } while (assumed != old); -} - -template -DI void myAtomicReduce(int* address, int val, ReduceLambda op) -{ - int old = *address, assumed; - do { - assumed = old; - old = atomicCAS(address, assumed, op(val, assumed)); - } while (assumed != old); -} - -template -DI void myAtomicReduce(long long* address, long long val, ReduceLambda op) -{ - long long old = *address, assumed; - do { - assumed = old; - old = atomicCAS(address, assumed, op(val, assumed)); - } while (assumed != old); -} - -template -DI void myAtomicReduce(unsigned long long* address, unsigned long long val, ReduceLambda op) -{ - unsigned long long old = *address, assumed; - do { - assumed = old; - old = atomicCAS(address, assumed, op(val, assumed)); - } while (assumed != old); -} - -/** - * @brief Provide atomic min operation. - * @tparam T: data type for input data (float or double). - * @param[in] address: address to read old value from, and to atomically update w/ min(old value, - * val) - * @param[in] val: new value to compare with old - */ -template -DI T myAtomicMin(T* address, T val); - -/** - * @brief Provide atomic max operation. - * @tparam T: data type for input data (float or double). - * @param[in] address: address to read old value from, and to atomically update w/ max(old value, - * val) - * @param[in] val: new value to compare with old - */ -template -DI T myAtomicMax(T* address, T val); - -DI float myAtomicMin(float* address, float val) -{ - myAtomicReduce(address, val, fminf); - return *address; -} - -DI float myAtomicMax(float* address, float val) -{ - myAtomicReduce(address, val, fmaxf); - return *address; -} - -DI double myAtomicMin(double* address, double val) -{ - myAtomicReduce(address, val, fmin); - return *address; -} - -DI double myAtomicMax(double* address, double val) -{ - myAtomicReduce(address, val, fmax); - return *address; -} - -/** - * @defgroup Max maximum of two numbers - * @{ - */ -template -HDI T myMax(T x, T y); -template <> -HDI float myMax(float x, float y) -{ - return fmaxf(x, y); -} -template <> -HDI double myMax(double x, double y) -{ - return fmax(x, y); -} -/** @} */ - -/** - * @defgroup Min minimum of two numbers - * @{ - */ -template -HDI T myMin(T x, T y); -template <> -HDI float myMin(float x, float y) -{ - return fminf(x, y); -} -template <> -HDI double myMin(double x, double y) -{ - return fmin(x, y); -} -/** @} */ - -/** - * @brief Provide atomic min operation. - * @tparam T: data type for input data (float or double). - * @param[in] address: address to read old value from, and to atomically update w/ min(old value, - * val) - * @param[in] val: new value to compare with old - */ -template -DI T myAtomicMin(T* address, T val) -{ - myAtomicReduce(address, val, myMin); - return *address; -} - -/** - * @brief Provide atomic max operation. - * @tparam T: data type for input data (float or double). - * @param[in] address: address to read old value from, and to atomically update w/ max(old value, - * val) - * @param[in] val: new value to compare with old - */ -template -DI T myAtomicMax(T* address, T val) -{ - myAtomicReduce(address, val, myMax); - return *address; -} - -/** - * Sign function - */ -template -HDI int sgn(const T val) -{ - return (T(0) < val) - (val < T(0)); -} - -/** - * @defgroup Exp Exponential function - * @{ - */ -template -HDI T myExp(T x); -template <> -HDI float myExp(float x) -{ - return expf(x); -} -template <> -HDI double myExp(double x) -{ - return exp(x); -} -/** @} */ - -/** - * @defgroup Cuda infinity values - * @{ - */ -template -inline __device__ T myInf(); -template <> -inline __device__ float myInf() -{ - return CUDART_INF_F; -} -template <> -inline __device__ double myInf() -{ - return CUDART_INF; -} -/** @} */ - -/** - * @defgroup Log Natural logarithm - * @{ - */ -template -HDI T myLog(T x); -template <> -HDI float myLog(float x) -{ - return logf(x); -} -template <> -HDI double myLog(double x) -{ - return log(x); -} -/** @} */ - -/** - * @defgroup Sqrt Square root - * @{ - */ -template -HDI T mySqrt(T x); -template <> -HDI float mySqrt(float x) -{ - return sqrtf(x); -} -template <> -HDI double mySqrt(double x) -{ - return sqrt(x); -} -/** @} */ - -/** - * @defgroup SineCosine Sine and cosine calculation - * @{ - */ -template -DI void mySinCos(T x, T& s, T& c); -template <> -DI void mySinCos(float x, float& s, float& c) -{ - sincosf(x, &s, &c); -} -template <> -DI void mySinCos(double x, double& s, double& c) -{ - sincos(x, &s, &c); -} -/** @} */ - -/** - * @defgroup Sine Sine calculation - * @{ - */ -template -DI T mySin(T x); -template <> -DI float mySin(float x) -{ - return sinf(x); -} -template <> -DI double mySin(double x) -{ - return sin(x); -} -/** @} */ - -/** - * @defgroup Abs Absolute value - * @{ - */ -template -DI T myAbs(T x) -{ - return x < 0 ? -x : x; -} -template <> -DI float myAbs(float x) -{ - return fabsf(x); -} -template <> -DI double myAbs(double x) -{ - return fabs(x); -} -/** @} */ - -/** - * @defgroup Pow Power function - * @{ - */ -template -HDI T myPow(T x, T power); -template <> -HDI float myPow(float x, float power) -{ - return powf(x, power); -} -template <> -HDI double myPow(double x, double power) -{ - return pow(x, power); -} -/** @} */ - -/** - * @defgroup myTanh tanh function - * @{ - */ -template -HDI T myTanh(T x); -template <> -HDI float myTanh(float x) -{ - return tanhf(x); -} -template <> -HDI double myTanh(double x) -{ - return tanh(x); -} -/** @} */ - -/** - * @defgroup myATanh arctanh function - * @{ - */ -template -HDI T myATanh(T x); -template <> -HDI float myATanh(float x) -{ - return atanhf(x); -} -template <> -HDI double myATanh(double x) -{ - return atanh(x); -} -/** @} */ - /** - * @defgroup LambdaOps Lambda operations in reduction kernels - * @{ + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -// IdxType mostly to be used for MainLambda in *Reduction kernels -template -struct Nop { - HDI Type operator()(Type in, IdxType i = 0) { return in; } -}; - -template -struct L1Op { - HDI Type operator()(Type in, IdxType i = 0) { return myAbs(in); } -}; - -template -struct L2Op { - HDI Type operator()(Type in, IdxType i = 0) { return in * in; } -}; - -template -struct Sum { - HDI Type operator()(Type a, Type b) { return a + b; } -}; -/** @} */ /** - * @defgroup Sign Obtain sign value - * @brief Obtain sign of x - * @param x input - * @return +1 if x >= 0 and -1 otherwise - * @{ + * DISCLAIMER: this file is deprecated: use lap.cuh instead */ -template -DI T signPrim(T x) -{ - return x < 0 ? -1 : +1; -} -template <> -DI float signPrim(float x) -{ - return signbit(x) == true ? -1.0f : +1.0f; -} -template <> -DI double signPrim(double x) -{ - return signbit(x) == true ? -1.0 : +1.0; -} -/** @} */ -/** - * @defgroup Max maximum of two numbers - * @brief Obtain maximum of two values - * @param x one item - * @param y second item - * @return maximum of two items - * @{ - */ -template -DI T maxPrim(T x, T y) -{ - return x > y ? x : y; -} -template <> -DI float maxPrim(float x, float y) -{ - return fmaxf(x, y); -} -template <> -DI double maxPrim(double x, double y) -{ - return fmax(x, y); -} -/** @} */ - -/** apply a warp-wide fence (useful from Volta+ archs) */ -DI void warpFence() -{ -#if __CUDA_ARCH__ >= 700 - __syncwarp(); -#endif -} - -/** warp-wide any boolean aggregator */ -DI bool any(bool inFlag, uint32_t mask = 0xffffffffu) -{ -#if CUDART_VERSION >= 9000 - inFlag = __any_sync(mask, inFlag); -#else - inFlag = __any(inFlag); -#endif - return inFlag; -} - -/** warp-wide all boolean aggregator */ -DI bool all(bool inFlag, uint32_t mask = 0xffffffffu) -{ -#if CUDART_VERSION >= 9000 - inFlag = __all_sync(mask, inFlag); -#else - inFlag = __all(inFlag); -#endif - return inFlag; -} - -/** - * @brief Shuffle the data inside a warp - * @tparam T the data type (currently assumed to be 4B) - * @param val value to be shuffled - * @param srcLane lane from where to shuffle - * @param width lane width - * @param mask mask of participating threads (Volta+) - * @return the shuffled data - */ -template -DI T shfl(T val, int srcLane, int width = WarpSize, uint32_t mask = 0xffffffffu) -{ -#if CUDART_VERSION >= 9000 - return __shfl_sync(mask, val, srcLane, width); -#else - return __shfl(val, srcLane, width); -#endif -} - -/** - * @brief Shuffle the data inside a warp from lower lane IDs - * @tparam T the data type (currently assumed to be 4B) - * @param val value to be shuffled - * @param delta lower lane ID delta from where to shuffle - * @param width lane width - * @param mask mask of participating threads (Volta+) - * @return the shuffled data - */ -template -DI T shfl_up(T val, int delta, int width = WarpSize, uint32_t mask = 0xffffffffu) -{ -#if CUDART_VERSION >= 9000 - return __shfl_up_sync(mask, val, delta, width); -#else - return __shfl_up(val, delta, width); -#endif -} - -/** - * @brief Shuffle the data inside a warp - * @tparam T the data type (currently assumed to be 4B) - * @param val value to be shuffled - * @param laneMask mask to be applied in order to perform xor shuffle - * @param width lane width - * @param mask mask of participating threads (Volta+) - * @return the shuffled data - */ -template -DI T shfl_xor(T val, int laneMask, int width = WarpSize, uint32_t mask = 0xffffffffu) -{ -#if CUDART_VERSION >= 9000 - return __shfl_xor_sync(mask, val, laneMask, width); -#else - return __shfl_xor(val, laneMask, width); -#endif -} - -/** - * @brief Four-way byte dot product-accumulate. - * @tparam T Four-byte integer: int or unsigned int - * @tparam S Either same as T or a 4-byte vector of the same signedness. - * - * @param a - * @param b - * @param c - * @return dot(a, b) + c - */ -template -DI auto dp4a(S a, S b, T c) -> T; - -template <> -DI auto dp4a(char4 a, char4 b, int c) -> int -{ -#if __CUDA_ARCH__ >= 610 - return __dp4a(a, b, c); -#else - c += static_cast(a.x) * static_cast(b.x); - c += static_cast(a.y) * static_cast(b.y); - c += static_cast(a.z) * static_cast(b.z); - c += static_cast(a.w) * static_cast(b.w); - return c; -#endif -} - -template <> -DI auto dp4a(uchar4 a, uchar4 b, unsigned int c) -> unsigned int -{ -#if __CUDA_ARCH__ >= 610 - return __dp4a(a, b, c); -#else - c += static_cast(a.x) * static_cast(b.x); - c += static_cast(a.y) * static_cast(b.y); - c += static_cast(a.z) * static_cast(b.z); - c += static_cast(a.w) * static_cast(b.w); - return c; -#endif -} - -template <> -DI auto dp4a(int a, int b, int c) -> int -{ -#if __CUDA_ARCH__ >= 610 - return __dp4a(a, b, c); -#else - return dp4a(*reinterpret_cast(&a), *reinterpret_cast(&b), c); -#endif -} - -template <> -DI auto dp4a(unsigned int a, unsigned int b, unsigned int c) -> unsigned int -{ -#if __CUDA_ARCH__ >= 610 - return __dp4a(a, b, c); -#else - return dp4a(*reinterpret_cast(&a), *reinterpret_cast(&b), c); -#endif -} - -/** - * @brief Warp-level sum reduction - * @param val input value - * @tparam T Value type to be reduced - * @return Reduction result. All lanes will have the valid result. - * @note Why not cub? Because cub doesn't seem to allow working with arbitrary - * number of warps in a block. All threads in the warp must enter this - * function together - * @todo Expand this to support arbitrary reduction ops - */ -template -DI T warpReduce(T val) -{ -#pragma unroll - for (int i = WarpSize / 2; i > 0; i >>= 1) { - T tmp = shfl_xor(val, i); - val += tmp; - } - return val; -} - -/** - * @brief 1-D block-level sum reduction - * @param val input value - * @param smem shared memory region needed for storing intermediate results. It - * must alteast be of size: `sizeof(T) * nWarps` - * @return only the thread0 will contain valid reduced result - * @note Why not cub? Because cub doesn't seem to allow working with arbitrary - * number of warps in a block. All threads in the block must enter this - * function together - * @todo Expand this to support arbitrary reduction ops - */ -template -DI T blockReduce(T val, char* smem) -{ - auto* sTemp = reinterpret_cast(smem); - int nWarps = (blockDim.x + WarpSize - 1) / WarpSize; - int lid = laneId(); - int wid = threadIdx.x / WarpSize; - val = warpReduce(val); - if (lid == 0) sTemp[wid] = val; - __syncthreads(); - val = lid < nWarps ? sTemp[lid] : T(0); - return warpReduce(val); -} +#pragma once -/** - * @brief Simple utility function to determine whether user_stream or one of the - * internal streams should be used. - * @param user_stream main user stream - * @param int_streams array of internal streams - * @param n_int_streams number of internal streams - * @param idx the index for which to query the stream - */ -inline cudaStream_t select_stream(cudaStream_t user_stream, - cudaStream_t* int_streams, - int n_int_streams, - int idx) -{ - return n_int_streams > 0 ? int_streams[idx % n_int_streams] : user_stream; -} +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/util version instead.") -} // namespace raft +#include diff --git a/cpp/include/raft/cudart_utils.h b/cpp/include/raft/cudart_utils.h index b4549e11c9..591f41629d 100644 --- a/cpp/include/raft/cudart_utils.h +++ b/cpp/include/raft/cudart_utils.h @@ -16,8 +16,8 @@ /** * This file is deprecated and will be removed in release 22.06. - * Please use core/cudart_utils.hpp instead. + * Please use util/cudart_utils.hpp instead. */ #pragma once -#include +#include diff --git a/cpp/include/raft/detail/mdarray.hpp b/cpp/include/raft/detail/mdarray.hpp index dd813a7c18..b61e82aaec 100644 --- a/cpp/include/raft/detail/mdarray.hpp +++ b/cpp/include/raft/detail/mdarray.hpp @@ -22,8 +22,8 @@ */ #pragma once #include -#include #include // dynamic_extent +#include #include #include diff --git a/cpp/include/raft/device_atomics.cuh b/cpp/include/raft/device_atomics.cuh index 28f7516688..a8bfc4d778 100644 --- a/cpp/include/raft/device_atomics.cuh +++ b/cpp/include/raft/device_atomics.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,656 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -/** - * @brief overloads for CUDA atomic operations - * @file device_atomics.cuh - * - * Provides the overloads for arithmetic data types, where CUDA atomic operations are, `atomicAdd`, - * `atomicMin`, `atomicMax`, and `atomicCAS`. - * `atomicAnd`, `atomicOr`, `atomicXor` are also supported for integer data types. - * Also provides `raft::genericAtomicOperation` which performs atomic operation with the given - * binary operator. - */ - -#include -#include - -namespace raft { - -namespace device_atomics { -namespace detail { - -// ------------------------------------------------------------------------------------------------- -// Binary operators - -/* @brief binary `sum` operator */ -struct DeviceSum { - template ::value>* = nullptr> - __device__ T operator()(const T& lhs, const T& rhs) - { - return lhs + rhs; - } -}; - -/* @brief binary `min` operator */ -struct DeviceMin { - template - __device__ T operator()(const T& lhs, const T& rhs) - { - return lhs < rhs ? lhs : rhs; - } -}; - -/* @brief binary `max` operator */ -struct DeviceMax { - template - __device__ T operator()(const T& lhs, const T& rhs) - { - return lhs > rhs ? lhs : rhs; - } -}; - -/* @brief binary `product` operator */ -struct DeviceProduct { - template ::value>* = nullptr> - __device__ T operator()(const T& lhs, const T& rhs) - { - return lhs * rhs; - } -}; - -/* @brief binary `and` operator */ -struct DeviceAnd { - template ::value>* = nullptr> - __device__ T operator()(const T& lhs, const T& rhs) - { - return (lhs & rhs); - } -}; - -/* @brief binary `or` operator */ -struct DeviceOr { - template ::value>* = nullptr> - __device__ T operator()(const T& lhs, const T& rhs) - { - return (lhs | rhs); - } -}; - -/* @brief binary `xor` operator */ -struct DeviceXor { - template ::value>* = nullptr> - __device__ T operator()(const T& lhs, const T& rhs) - { - return (lhs ^ rhs); - } -}; - -// FIXME: remove this if C++17 is supported. -// `static_assert` requires a string literal at C++14. -#define errmsg_cast "size mismatch." - -template -__forceinline__ __device__ T_output type_reinterpret(T_input value) -{ - static_assert(sizeof(T_output) == sizeof(T_input), "type_reinterpret for different size"); - return *(reinterpret_cast(&value)); -} - -// ------------------------------------------------------------------------------------------------- -// the implementation of `genericAtomicOperation` - -template -struct genericAtomicOperationImpl; - -// single byte atomic operation -template -struct genericAtomicOperationImpl { - __forceinline__ __device__ T operator()(T* addr, T const& update_value, Op op) - { - using T_int = unsigned int; - - T_int* address_uint32 = reinterpret_cast(addr - (reinterpret_cast(addr) & 3)); - T_int shift = ((reinterpret_cast(addr) & 3) * 8); - - T_int old = *address_uint32; - T_int assumed; - - do { - assumed = old; - T target_value = T((old >> shift) & 0xff); - uint8_t updating_value = type_reinterpret(op(target_value, update_value)); - T_int new_value = (old & ~(0x000000ff << shift)) | (T_int(updating_value) << shift); - old = atomicCAS(address_uint32, assumed, new_value); - } while (assumed != old); - - return T((old >> shift) & 0xff); - } -}; - -// 2 bytes atomic operation -template -struct genericAtomicOperationImpl { - __forceinline__ __device__ T operator()(T* addr, T const& update_value, Op op) - { - using T_int = unsigned int; - bool is_32_align = (reinterpret_cast(addr) & 2) ? false : true; - T_int* address_uint32 = - reinterpret_cast(reinterpret_cast(addr) - (is_32_align ? 0 : 2)); - - T_int old = *address_uint32; - T_int assumed; - - do { - assumed = old; - T target_value = (is_32_align) ? T(old & 0xffff) : T(old >> 16); - uint16_t updating_value = type_reinterpret(op(target_value, update_value)); - - T_int new_value = (is_32_align) ? (old & 0xffff0000) | updating_value - : (old & 0xffff) | (T_int(updating_value) << 16); - old = atomicCAS(address_uint32, assumed, new_value); - } while (assumed != old); - - return (is_32_align) ? T(old & 0xffff) : T(old >> 16); - ; - } -}; - -// 4 bytes atomic operation -template -struct genericAtomicOperationImpl { - __forceinline__ __device__ T operator()(T* addr, T const& update_value, Op op) - { - using T_int = unsigned int; - T old_value = *addr; - T assumed{old_value}; - - if constexpr (std::is_same{} && (std::is_same{})) { - if (isnan(update_value)) { return old_value; } - } - - do { - assumed = old_value; - const T new_value = op(old_value, update_value); - - T_int ret = atomicCAS(reinterpret_cast(addr), - type_reinterpret(assumed), - type_reinterpret(new_value)); - old_value = type_reinterpret(ret); - } while (assumed != old_value); - - return old_value; - } -}; - -// 4 bytes fp32 atomic Max operation -template <> -struct genericAtomicOperationImpl { - using T = float; - __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceMax op) - { - if (isnan(update_value)) { return *addr; } - - T old = (update_value >= 0) - ? __int_as_float(atomicMax((int*)addr, __float_as_int(update_value))) - : __uint_as_float(atomicMin((unsigned int*)addr, __float_as_uint(update_value))); - - return old; - } -}; - -// 8 bytes atomic operation -template -struct genericAtomicOperationImpl { - __forceinline__ __device__ T operator()(T* addr, T const& update_value, Op op) - { - using T_int = unsigned long long int; - static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); - - T old_value = *addr; - T assumed{old_value}; - - do { - assumed = old_value; - const T new_value = op(old_value, update_value); - - T_int ret = atomicCAS(reinterpret_cast(addr), - type_reinterpret(assumed), - type_reinterpret(new_value)); - old_value = type_reinterpret(ret); - - } while (assumed != old_value); - - return old_value; - } -}; - -// ------------------------------------------------------------------------------------------------- -// specialized functions for operators -// `atomicAdd` supports int, unsigned int, unsigend long long int, float, double (long long int is -// not supproted.) `atomicMin`, `atomicMax` support int, unsigned int, unsigned long long int -// `atomicAnd`, `atomicOr`, `atomicXor` support int, unsigned int, unsigned long long int - -// CUDA natively supports `unsigned long long int` for `atomicAdd`, -// but doesn't supports `long int`. -// However, since the signed integer is represented as Two's complement, -// the fundamental arithmetic operations of addition are identical to -// those for unsigned binary numbers. -// Then, this computes as `unsigned long long int` with `atomicAdd` -// @sa https://en.wikipedia.org/wiki/Two%27s_complement -template <> -struct genericAtomicOperationImpl { - using T = long int; - __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceSum op) - { - using T_int = unsigned long long int; - static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); - T_int ret = atomicAdd(reinterpret_cast(addr), type_reinterpret(update_value)); - return type_reinterpret(ret); - } -}; - -template <> -struct genericAtomicOperationImpl { - using T = unsigned long int; - __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceSum op) - { - using T_int = unsigned long long int; - static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); - T_int ret = atomicAdd(reinterpret_cast(addr), type_reinterpret(update_value)); - return type_reinterpret(ret); - } -}; - -// CUDA natively supports `unsigned long long int` for `atomicAdd`, -// but doesn't supports `long long int`. -// However, since the signed integer is represented as Two's complement, -// the fundamental arithmetic operations of addition are identical to -// those for unsigned binary numbers. -// Then, this computes as `unsigned long long int` with `atomicAdd` -// @sa https://en.wikipedia.org/wiki/Two%27s_complement -template <> -struct genericAtomicOperationImpl { - using T = long long int; - __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceSum op) - { - using T_int = unsigned long long int; - static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); - T_int ret = atomicAdd(reinterpret_cast(addr), type_reinterpret(update_value)); - return type_reinterpret(ret); - } -}; - -template <> -struct genericAtomicOperationImpl { - using T = unsigned long int; - __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceMin op) - { - using T_int = unsigned long long int; - static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); - T ret = atomicMin(reinterpret_cast(addr), type_reinterpret(update_value)); - return type_reinterpret(ret); - } -}; - -template <> -struct genericAtomicOperationImpl { - using T = unsigned long int; - __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceMax op) - { - using T_int = unsigned long long int; - static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); - T ret = atomicMax(reinterpret_cast(addr), type_reinterpret(update_value)); - return type_reinterpret(ret); - } -}; - -template -struct genericAtomicOperationImpl { - __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceAnd op) - { - using T_int = unsigned long long int; - static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); - T_int ret = atomicAnd(reinterpret_cast(addr), type_reinterpret(update_value)); - return type_reinterpret(ret); - } -}; - -template -struct genericAtomicOperationImpl { - __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceOr op) - { - using T_int = unsigned long long int; - static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); - T_int ret = atomicOr(reinterpret_cast(addr), type_reinterpret(update_value)); - return type_reinterpret(ret); - } -}; - -template -struct genericAtomicOperationImpl { - __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceXor op) - { - using T_int = unsigned long long int; - static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); - T_int ret = atomicXor(reinterpret_cast(addr), type_reinterpret(update_value)); - return type_reinterpret(ret); - } -}; - -// ------------------------------------------------------------------------------------------------- -// the implementation of `typesAtomicCASImpl` - -template -struct typesAtomicCASImpl; - -template -struct typesAtomicCASImpl { - __forceinline__ __device__ T operator()(T* addr, T const& compare, T const& update_value) - { - using T_int = unsigned int; - - T_int shift = ((reinterpret_cast(addr) & 3) * 8); - T_int* address_uint32 = reinterpret_cast(addr - (reinterpret_cast(addr) & 3)); - - // the 'target_value' in `old` can be different from `compare` - // because other thread may update the value - // before fetching a value from `address_uint32` in this function - T_int old = *address_uint32; - T_int assumed; - T target_value; - uint8_t u_val = type_reinterpret(update_value); - - do { - assumed = old; - target_value = T((old >> shift) & 0xff); - // have to compare `target_value` and `compare` before calling atomicCAS - // the `target_value` in `old` can be different with `compare` - if (target_value != compare) break; - - T_int new_value = (old & ~(0x000000ff << shift)) | (T_int(u_val) << shift); - old = atomicCAS(address_uint32, assumed, new_value); - } while (assumed != old); - - return target_value; - } -}; - -template -struct typesAtomicCASImpl { - __forceinline__ __device__ T operator()(T* addr, T const& compare, T const& update_value) - { - using T_int = unsigned int; - - bool is_32_align = (reinterpret_cast(addr) & 2) ? false : true; - T_int* address_uint32 = - reinterpret_cast(reinterpret_cast(addr) - (is_32_align ? 0 : 2)); - - T_int old = *address_uint32; - T_int assumed; - T target_value; - uint16_t u_val = type_reinterpret(update_value); - - do { - assumed = old; - target_value = (is_32_align) ? T(old & 0xffff) : T(old >> 16); - if (target_value != compare) break; - - T_int new_value = - (is_32_align) ? (old & 0xffff0000) | u_val : (old & 0xffff) | (T_int(u_val) << 16); - old = atomicCAS(address_uint32, assumed, new_value); - } while (assumed != old); - - return target_value; - } -}; - -template -struct typesAtomicCASImpl { - __forceinline__ __device__ T operator()(T* addr, T const& compare, T const& update_value) - { - using T_int = unsigned int; - - T_int ret = atomicCAS(reinterpret_cast(addr), - type_reinterpret(compare), - type_reinterpret(update_value)); - return type_reinterpret(ret); - } -}; - -// 8 bytes atomic operation -template -struct typesAtomicCASImpl { - __forceinline__ __device__ T operator()(T* addr, T const& compare, T const& update_value) - { - using T_int = unsigned long long int; - static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); - - T_int ret = atomicCAS(reinterpret_cast(addr), - type_reinterpret(compare), - type_reinterpret(update_value)); - - return type_reinterpret(ret); - } -}; - -} // namespace detail -} // namespace device_atomics - -/** -------------------------------------------------------------------------* - * @brief compute atomic binary operation - * reads the `old` located at the `address` in global or shared memory, - * computes 'BinaryOp'('old', 'update_value'), - * and stores the result back to memory at the same address. - * These three operations are performed in one atomic transaction. - * - * The supported cudf types for `genericAtomicOperation` are: - * int8_t, int16_t, int32_t, int64_t, float, double - * - * @param[in] address The address of old value in global or shared memory - * @param[in] update_value The value to be computed - * @param[in] op The binary operator used for compute - * - * @returns The old value at `address` - * -------------------------------------------------------------------------**/ -template -typename std::enable_if_t::value, T> __forceinline__ __device__ -genericAtomicOperation(T* address, T const& update_value, BinaryOp op) -{ - auto fun = raft::device_atomics::detail::genericAtomicOperationImpl{}; - return T(fun(address, update_value, op)); -} - -// specialization for bool types -template -__forceinline__ __device__ bool genericAtomicOperation(bool* address, - bool const& update_value, - BinaryOp op) -{ - using T = bool; - // don't use underlying type to apply operation for bool - auto fun = raft::device_atomics::detail::genericAtomicOperationImpl{}; - return T(fun(address, update_value, op)); -} - -} // namespace raft - /** - * @brief Overloads for `atomicAdd` - * - * reads the `old` located at the `address` in global or shared memory, computes (old + val), and - * stores the result back to memory at the same address. These three operations are performed in one - * atomic transaction. - * - * The supported types for `atomicAdd` are: integers are floating point numbers. - * CUDA natively supports `int`, `unsigned int`, `unsigned long long int`, `float`, `double. - * - * @param[in] address The address of old value in global or shared memory - * @param[in] val The value to be added - * - * @returns The old value at `address` - */ -template -__forceinline__ __device__ T atomicAdd(T* address, T val) -{ - return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceSum{}); -} - -/** - * @brief Overloads for `atomicMin` - * - * reads the `old` located at the `address` in global or shared memory, computes the minimum of old - * and val, and stores the result back to memory at the same address. These three operations are - * performed in one atomic transaction. - * - * The supported types for `atomicMin` are: integers are floating point numbers. - * CUDA natively supports `int`, `unsigend int`, `unsigned long long int`. - * - * @param[in] address The address of old value in global or shared memory - * @param[in] val The value to be computed - * - * @returns The old value at `address` + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -template -__forceinline__ __device__ T atomicMin(T* address, T val) -{ - return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceMin{}); -} /** - * @brief Overloads for `atomicMax` - * - * reads the `old` located at the `address` in global or shared memory, computes the maximum of old - * and val, and stores the result back to memory at the same address. These three operations are - * performed in one atomic transaction. - * - * The supported types for `atomicMax` are: integers are floating point numbers. - * CUDA natively supports `int`, `unsigend int`, `unsigned long long int`. - * - * @param[in] address The address of old value in global or shared memory - * @param[in] val The value to be computed - * - * @returns The old value at `address` + * DISCLAIMER: this file is deprecated: use lap.cuh instead */ -template -__forceinline__ __device__ T atomicMax(T* address, T val) -{ - return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceMax{}); -} -/** - * @brief Overloads for `atomicCAS` - * - * reads the `old` located at the `address` in global or shared memory, computes - * (`old` == `compare` ? `val` : `old`), and stores the result back to memory at the same address. - * These three operations are performed in one atomic transaction. - * - * The supported types for `atomicCAS` are: integers are floating point numbers. - * CUDA natively supports `int`, `unsigned int`, `unsigned long long int`, `unsigned short int`. - * - * @param[in] address The address of old value in global or shared memory - * @param[in] compare The value to be compared - * @param[in] val The value to be computed - * - * @returns The old value at `address` - */ -template -__forceinline__ __device__ T atomicCAS(T* address, T compare, T val) -{ - return raft::device_atomics::detail::typesAtomicCASImpl()(address, compare, val); -} - -/** - * @brief Overloads for `atomicAnd` - * - * reads the `old` located at the `address` in global or shared memory, computes (old & val), and - * stores the result back to memory at the same address. These three operations are performed in - * one atomic transaction. - * - * The supported types for `atomicAnd` are: integers. - * CUDA natively supports `int`, `unsigned int`, `unsigned long long int`. - * - * @param[in] address The address of old value in global or shared memory - * @param[in] val The value to be computed - * - * @returns The old value at `address` - */ -template ::value, T>* = nullptr> -__forceinline__ __device__ T atomicAnd(T* address, T val) -{ - return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceAnd{}); -} - -/** - * @brief Overloads for `atomicOr` - * - * reads the `old` located at the `address` in global or shared memory, computes (old | val), and - * stores the result back to memory at the same address. These three operations are performed in - * one atomic transaction. - * - * The supported types for `atomicOr` are: integers. - * CUDA natively supports `int`, `unsigned int`, `unsigned long long int`. - * - * @param[in] address The address of old value in global or shared memory - * @param[in] val The value to be computed - * - * @returns The old value at `address` - */ -template ::value, T>* = nullptr> -__forceinline__ __device__ T atomicOr(T* address, T val) -{ - return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceOr{}); -} +#pragma once -/** - * @brief Overloads for `atomicXor` - * - * reads the `old` located at the `address` in global or shared memory, computes (old ^ val), and - * stores the result back to memory at the same address. These three operations are performed in - * one atomic transaction. - * - * The supported types for `atomicXor` are: integers. - * CUDA natively supports `int`, `unsigned int`, `unsigned long long int`. - * - * @param[in] address The address of old value in global or shared memory - * @param[in] val The value to be computed - * - * @returns The old value at `address` - */ -template ::value, T>* = nullptr> -__forceinline__ __device__ T atomicXor(T* address, T val) -{ - return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceXor{}); -} +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/util version instead.") -/** - * @brief: Warp aggregated atomic increment - * - * increments an atomic counter using all active threads in a warp. The return - * value is the original value of the counter plus the rank of the calling - * thread. - * - * The use of atomicIncWarp is a performance optimization. It can reduce the - * amount of atomic memory traffic by a factor of 32. - * - * Adapted from: - * https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/ - * - * @tparam T An integral type - * @param[in,out] ctr The address of old value - * - * @return The old value of the counter plus the rank of the calling thread. - */ -template ::value, T>* = nullptr> -__device__ T atomicIncWarp(T* ctr) -{ - namespace cg = cooperative_groups; - auto g = cg::coalesced_threads(); - T warp_res; - if (g.thread_rank() == 0) { warp_res = atomicAdd(ctr, static_cast(g.size())); } - return g.shfl(warp_res, 0) + g.thread_rank(); -} +#include diff --git a/cpp/include/raft/device_utils.cuh b/cpp/include/raft/device_utils.cuh index d89a484109..5e6cf47c7d 100644 --- a/cpp/include/raft/device_utils.cuh +++ b/cpp/include/raft/device_utils.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,96 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -#include -#include // pair - -namespace raft { - -// TODO move to raft https://github.com/rapidsai/raft/issues/90 -/** helper method to get the compute capability version numbers */ -inline std::pair getDeviceCapability() -{ - int devId; - RAFT_CUDA_TRY(cudaGetDevice(&devId)); - int major, minor; - RAFT_CUDA_TRY(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, devId)); - RAFT_CUDA_TRY(cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, devId)); - return std::make_pair(major, minor); -} - /** - * @brief Batched warp-level sum reduction - * - * @tparam T data type - * @tparam NThreads Number of threads in the warp doing independent reductions - * - * @param[in] val input value - * @return for the first "group" of threads, the reduced value. All - * others will contain unusable values! - * - * @note Why not cub? Because cub doesn't seem to allow working with arbitrary - * number of warps in a block and also doesn't support this kind of - * batched reduction operation - * @note All threads in the warp must enter this function together - * - * @todo Expand this to support arbitrary reduction ops + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -template -DI T batchedWarpReduce(T val) -{ -#pragma unroll - for (int i = NThreads; i < raft::WarpSize; i <<= 1) { - val += raft::shfl(val, raft::laneId() + i); - } - return val; -} /** - * @brief 1-D block-level batched sum reduction - * - * @tparam T data type - * @tparam NThreads Number of threads in the warp doing independent reductions - * - * @param val input value - * @param smem shared memory region needed for storing intermediate results. It - * must alteast be of size: `sizeof(T) * nWarps * NThreads` - * @return for the first "group" of threads in the block, the reduced value. - * All others will contain unusable values! - * - * @note Why not cub? Because cub doesn't seem to allow working with arbitrary - * number of warps in a block and also doesn't support this kind of - * batched reduction operation - * @note All threads in the block must enter this function together - * - * @todo Expand this to support arbitrary reduction ops + * DISCLAIMER: this file is deprecated: use lap.cuh instead */ -template -DI T batchedBlockReduce(T val, char* smem) -{ - auto* sTemp = reinterpret_cast(smem); - constexpr int nGroupsPerWarp = raft::WarpSize / NThreads; - static_assert(raft::isPo2(nGroupsPerWarp), "nGroupsPerWarp must be a PO2!"); - const int nGroups = (blockDim.x + NThreads - 1) / NThreads; - const int lid = raft::laneId(); - const int lgid = lid % NThreads; - const int gid = threadIdx.x / NThreads; - const auto wrIdx = (gid / nGroupsPerWarp) * NThreads + lgid; - const auto rdIdx = gid * NThreads + lgid; - for (int i = nGroups; i > 0;) { - auto iAligned = ((i + nGroupsPerWarp - 1) / nGroupsPerWarp) * nGroupsPerWarp; - if (gid < iAligned) { - val = batchedWarpReduce(val); - if (lid < NThreads) sTemp[wrIdx] = val; - } - __syncthreads(); - i /= nGroupsPerWarp; - if (i > 0) { val = gid < i ? sTemp[rdIdx] : T(0); } - __syncthreads(); - } - return val; -} -} // namespace raft +#pragma once + +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/util version instead.") + +#include diff --git a/cpp/include/raft/distance/detail/correlation.cuh b/cpp/include/raft/distance/detail/correlation.cuh index c88d5afeab..2b77d280fe 100644 --- a/cpp/include/raft/distance/detail/correlation.cuh +++ b/cpp/include/raft/distance/detail/correlation.cuh @@ -15,9 +15,9 @@ */ #pragma once -#include #include #include +#include namespace raft { namespace distance { diff --git a/cpp/include/raft/distance/detail/distance.cuh b/cpp/include/raft/distance/detail/distance.cuh index 4782afe46e..fa0c7a48cc 100644 --- a/cpp/include/raft/distance/detail/distance.cuh +++ b/cpp/include/raft/distance/detail/distance.cuh @@ -17,7 +17,6 @@ #pragma once #include -#include #include #include #include @@ -30,7 +29,8 @@ #include #include #include -#include +#include +#include #include namespace raft { diff --git a/cpp/include/raft/distance/detail/fused_l2_nn.cuh b/cpp/include/raft/distance/detail/fused_l2_nn.cuh index 308f8a096a..f46338943f 100644 --- a/cpp/include/raft/distance/detail/fused_l2_nn.cuh +++ b/cpp/include/raft/distance/detail/fused_l2_nn.cuh @@ -18,9 +18,9 @@ #include #include -#include #include #include +#include #include namespace raft { diff --git a/cpp/include/raft/distance/detail/pairwise_distance_base.cuh b/cpp/include/raft/distance/detail/pairwise_distance_base.cuh index 9d203c0c4f..27e9935358 100644 --- a/cpp/include/raft/distance/detail/pairwise_distance_base.cuh +++ b/cpp/include/raft/distance/detail/pairwise_distance_base.cuh @@ -14,11 +14,11 @@ * limitations under the License. */ #pragma once -#include -#include #include #include -#include +#include +#include +#include #include diff --git a/cpp/include/raft/distance/distance.cuh b/cpp/include/raft/distance/distance.cuh index 3db1749bb4..4f9667e449 100644 --- a/cpp/include/raft/distance/distance.cuh +++ b/cpp/include/raft/distance/distance.cuh @@ -18,9 +18,9 @@ #pragma once +#include #include -#include -#include +#include #include #include diff --git a/cpp/include/raft/distance/distance_type.hpp b/cpp/include/raft/distance/distance_type.hpp index f75263b00d..f6eb4614f9 100644 --- a/cpp/include/raft/distance/distance_type.hpp +++ b/cpp/include/raft/distance/distance_type.hpp @@ -1,5 +1,5 @@ /* - * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * Copyright (c) 2018-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,57 +13,15 @@ * See the License for the specific language governing permissions and * limitations under the License. */ +/** + * This file is deprecated and will be removed at some point in a future release. + * Please use `raft/distance/distance_types.hpp` instead. + */ #pragma once -namespace raft { -namespace distance { - -/** enum to tell how to compute distance */ -enum DistanceType : unsigned short { +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use distance_types.hpp instead.") - /** evaluate as dist_ij = sum(x_ik^2) + sum(y_ij)^2 - 2*sum(x_ik * y_jk) */ - L2Expanded = 0, - /** same as above, but inside the epilogue, perform square root operation */ - L2SqrtExpanded = 1, - /** cosine distance */ - CosineExpanded = 2, - /** L1 distance */ - L1 = 3, - /** evaluate as dist_ij += (x_ik - y-jk)^2 */ - L2Unexpanded = 4, - /** same as above, but inside the epilogue, perform square root operation */ - L2SqrtUnexpanded = 5, - /** basic inner product **/ - InnerProduct = 6, - /** Chebyshev (Linf) distance **/ - Linf = 7, - /** Canberra distance **/ - Canberra = 8, - /** Generalized Minkowski distance **/ - LpUnexpanded = 9, - /** Correlation distance **/ - CorrelationExpanded = 10, - /** Jaccard distance **/ - JaccardExpanded = 11, - /** Hellinger distance **/ - HellingerExpanded = 12, - /** Haversine distance **/ - Haversine = 13, - /** Bray-Curtis distance **/ - BrayCurtis = 14, - /** Jensen-Shannon distance**/ - JensenShannon = 15, - /** Hamming distance **/ - HammingUnexpanded = 16, - /** KLDivergence **/ - KLDivergence = 17, - /** RusselRao **/ - RusselRaoExpanded = 18, - /** Dice-Sorensen distance **/ - DiceExpanded = 19, - /** Precomputed (special value) **/ - Precomputed = 100 -}; -}; // namespace distance -}; // end namespace raft +#include \ No newline at end of file diff --git a/cpp/include/raft/distance/distance_types.hpp b/cpp/include/raft/distance/distance_types.hpp new file mode 100644 index 0000000000..f75263b00d --- /dev/null +++ b/cpp/include/raft/distance/distance_types.hpp @@ -0,0 +1,69 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +namespace raft { +namespace distance { + +/** enum to tell how to compute distance */ +enum DistanceType : unsigned short { + + /** evaluate as dist_ij = sum(x_ik^2) + sum(y_ij)^2 - 2*sum(x_ik * y_jk) */ + L2Expanded = 0, + /** same as above, but inside the epilogue, perform square root operation */ + L2SqrtExpanded = 1, + /** cosine distance */ + CosineExpanded = 2, + /** L1 distance */ + L1 = 3, + /** evaluate as dist_ij += (x_ik - y-jk)^2 */ + L2Unexpanded = 4, + /** same as above, but inside the epilogue, perform square root operation */ + L2SqrtUnexpanded = 5, + /** basic inner product **/ + InnerProduct = 6, + /** Chebyshev (Linf) distance **/ + Linf = 7, + /** Canberra distance **/ + Canberra = 8, + /** Generalized Minkowski distance **/ + LpUnexpanded = 9, + /** Correlation distance **/ + CorrelationExpanded = 10, + /** Jaccard distance **/ + JaccardExpanded = 11, + /** Hellinger distance **/ + HellingerExpanded = 12, + /** Haversine distance **/ + Haversine = 13, + /** Bray-Curtis distance **/ + BrayCurtis = 14, + /** Jensen-Shannon distance**/ + JensenShannon = 15, + /** Hamming distance **/ + HammingUnexpanded = 16, + /** KLDivergence **/ + KLDivergence = 17, + /** RusselRao **/ + RusselRaoExpanded = 18, + /** Dice-Sorensen distance **/ + DiceExpanded = 19, + /** Precomputed (special value) **/ + Precomputed = 100 +}; +}; // namespace distance +}; // end namespace raft diff --git a/cpp/include/raft/distance/fused_l2_nn.cuh b/cpp/include/raft/distance/fused_l2_nn.cuh index 121ccbf60d..c1cf790203 100644 --- a/cpp/include/raft/distance/fused_l2_nn.cuh +++ b/cpp/include/raft/distance/fused_l2_nn.cuh @@ -21,10 +21,10 @@ #include #include -#include +#include #include -#include #include +#include #include namespace raft { diff --git a/cpp/include/raft/integer_utils.h b/cpp/include/raft/integer_utils.h index a2ce7598c6..8962c3d713 100644 --- a/cpp/include/raft/integer_utils.h +++ b/cpp/include/raft/integer_utils.h @@ -1,6 +1,4 @@ /* - * Copyright 2019 BlazingDB, Inc. - * Copyright 2019 Eyal Rozenberg * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); @@ -15,170 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -/** - * @file Utility code involving integer arithmetic - * - */ - -#include -#include - -namespace raft { -//! Utility functions -/** - * Finds the smallest integer not less than `number_to_round` and modulo `S` is - * zero. This function assumes that `number_to_round` is non-negative and - * `modulus` is positive. - */ -template -inline S round_up_safe(S number_to_round, S modulus) -{ - auto remainder = number_to_round % modulus; - if (remainder == 0) { return number_to_round; } - auto rounded_up = number_to_round - remainder + modulus; - if (rounded_up < number_to_round) { - throw std::invalid_argument("Attempt to round up beyond the type's maximum value"); - } - return rounded_up; -} - -/** - * Finds the largest integer not greater than `number_to_round` and modulo `S` is - * zero. This function assumes that `number_to_round` is non-negative and - * `modulus` is positive. - */ -template -inline S round_down_safe(S number_to_round, S modulus) -{ - auto remainder = number_to_round % modulus; - auto rounded_down = number_to_round - remainder; - return rounded_down; -} - -/** - * Divides the left-hand-side by the right-hand-side, rounding up - * to an integral multiple of the right-hand-side, e.g. (9,5) -> 2 , (10,5) -> 2, (11,5) -> 3. - * - * @param dividend the number to divide - * @param divisor the number by which to divide - * @return The least integer multiple of {@link divisor} which is greater than or equal to - * the non-integral division dividend/divisor. - * - * @note sensitive to overflow, i.e. if dividend > std::numeric_limits::max() - divisor, - * the result will be incorrect - */ -template -constexpr inline S div_rounding_up_unsafe(const S& dividend, const T& divisor) noexcept -{ - return (dividend + divisor - 1) / divisor; -} - -namespace detail { -template -constexpr inline I div_rounding_up_safe(std::integral_constant, - I dividend, - I divisor) noexcept -{ - // TODO: This could probably be implemented faster - return (dividend > divisor) ? 1 + div_rounding_up_unsafe(dividend - divisor, divisor) - : (dividend > 0); -} - -template -constexpr inline I div_rounding_up_safe(std::integral_constant, - I dividend, - I divisor) noexcept -{ - auto quotient = dividend / divisor; - auto remainder = dividend % divisor; - return quotient + (remainder != 0); -} - -} // namespace detail - /** - * Divides the left-hand-side by the right-hand-side, rounding up - * to an integral multiple of the right-hand-side, e.g. (9,5) -> 2 , (10,5) -> 2, (11,5) -> 3. - * - * @param dividend the number to divide - * @param divisor the number of by which to divide - * @return The least integer multiple of {@link divisor} which is greater than or equal to - * the non-integral division dividend/divisor. - * - * @note will not overflow, and may _or may not_ be slower than the intuitive - * approach of using (dividend + divisor - 1) / divisor + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -template -constexpr inline std::enable_if_t::value, I> div_rounding_up_safe( - I dividend, I divisor) noexcept -{ - using i_is_a_signed_type = std::integral_constant::value>; - return detail::div_rounding_up_safe(i_is_a_signed_type{}, dividend, divisor); -} - -template -constexpr inline std::enable_if_t::value, bool> is_a_power_of_two( - I val) noexcept -{ - return ((val - 1) & val) == 0; -} /** - * @brief Return the absolute value of a number. - * - * This calls `std::abs()` which performs equivalent: `(value < 0) ? -value : value`. - * - * This was created to prevent compile errors calling `std::abs()` with unsigned integers. - * An example compile error appears as follows: - * @code{.pseudo} - * error: more than one instance of overloaded function "std::abs" matches the argument list: - * function "abs(int)" - * function "std::abs(long)" - * function "std::abs(long long)" - * function "std::abs(double)" - * function "std::abs(float)" - * function "std::abs(long double)" - * argument types are: (uint64_t) - * @endcode - * - * Not all cases could be if-ed out using `std::is_signed::value` and satisfy the compiler. - * - * @param value Numeric value can be either integer or float type. - * @return Absolute value if value type is signed. - */ -template -std::enable_if_t::value, T> constexpr inline absolute_value(T value) -{ - return std::abs(value); -} -// Unsigned type just returns itself. -template -std::enable_if_t::value, T> constexpr inline absolute_value(T value) -{ - return value; -} - -/** - * @defgroup Check whether the numeric conversion is narrowing - * - * @tparam From source type - * @tparam To destination type - * @{ + * DISCLAIMER: this file is deprecated: use lap.cuh instead */ -template -struct is_narrowing : std::true_type { -}; -template -struct is_narrowing()})>> : std::false_type { -}; -/** @} */ +#pragma once -/** Check whether the numeric conversion is narrowing */ -template -inline constexpr bool is_narrowing_v = is_narrowing::value; // NOLINT +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/util version instead.") -} // namespace raft +#include diff --git a/cpp/include/raft/label/detail/classlabels.cuh b/cpp/include/raft/label/detail/classlabels.cuh index a941751d78..0af1c70b91 100644 --- a/cpp/include/raft/label/detail/classlabels.cuh +++ b/cpp/include/raft/label/detail/classlabels.cuh @@ -18,9 +18,9 @@ #include -#include -#include #include +#include +#include #include #include diff --git a/cpp/include/raft/label/detail/merge_labels.cuh b/cpp/include/raft/label/detail/merge_labels.cuh index 1f62b3f0d6..f93a97d52b 100644 --- a/cpp/include/raft/label/detail/merge_labels.cuh +++ b/cpp/include/raft/label/detail/merge_labels.cuh @@ -19,9 +19,9 @@ #include #include -#include -#include #include +#include +#include namespace raft { namespace label { diff --git a/cpp/include/raft/lap/lap.cuh b/cpp/include/raft/lap/lap.cuh index e9a862e45a..ca7d5e96a9 100644 --- a/cpp/include/raft/lap/lap.cuh +++ b/cpp/include/raft/lap/lap.cuh @@ -1,6 +1,5 @@ /* * Copyright (c) 2020-2022, NVIDIA CORPORATION. - * Copyright 2020 KETAN DATE & RAKESH NAGI * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,283 +12,27 @@ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. - * - * CUDA Implementation of O(n^3) alternating tree Hungarian Algorithm - * Authors: Ketan Date and Rakesh Nagi - * - * Article reference: - * Date, Ketan, and Rakesh Nagi. "GPU-accelerated Hungarian algorithms - * for the Linear Assignment Problem." Parallel Computing 57 (2016): 52-72. - * + */ +/** + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -#ifndef __LAP_H -#define __LAP_H +/** + * DISCLAIMER: this file is deprecated: use lap.cuh instead + */ #pragma once -#include -#include - -#include -#include - -#include "detail/d_structs.h" -#include "detail/lap_functions.cuh" - -namespace raft { -namespace lap { - -template -class LinearAssignmentProblem { - vertex_t size_; - vertex_t batchsize_; - weight_t epsilon_; - - weight_t const* d_costs_; - - Vertices d_vertices_dev; - VertexData d_row_data_dev, d_col_data_dev; - - raft::handle_t const& handle_; - rmm::device_uvector row_covers_v; - rmm::device_uvector col_covers_v; - rmm::device_uvector row_duals_v; - rmm::device_uvector col_duals_v; - rmm::device_uvector col_slacks_v; - rmm::device_uvector row_is_visited_v; - rmm::device_uvector col_is_visited_v; - rmm::device_uvector row_parents_v; - rmm::device_uvector col_parents_v; - rmm::device_uvector row_children_v; - rmm::device_uvector col_children_v; - rmm::device_uvector obj_val_primal_v; - rmm::device_uvector obj_val_dual_v; - - public: - LinearAssignmentProblem(raft::handle_t const& handle, - vertex_t size, - vertex_t batchsize, - weight_t epsilon) - : handle_(handle), - size_(size), - batchsize_(batchsize), - epsilon_(epsilon), - d_costs_(nullptr), - row_covers_v(0, handle_.get_stream()), - col_covers_v(0, handle_.get_stream()), - row_duals_v(0, handle_.get_stream()), - col_duals_v(0, handle_.get_stream()), - col_slacks_v(0, handle_.get_stream()), - row_is_visited_v(0, handle_.get_stream()), - col_is_visited_v(0, handle_.get_stream()), - row_parents_v(0, handle_.get_stream()), - col_parents_v(0, handle_.get_stream()), - row_children_v(0, handle_.get_stream()), - col_children_v(0, handle_.get_stream()), - obj_val_primal_v(0, handle_.get_stream()), - obj_val_dual_v(0, handle_.get_stream()) - { - } - - // Executes Hungarian algorithm on the input cost matrix. - void solve(weight_t const* d_cost_matrix, vertex_t* d_row_assignment, vertex_t* d_col_assignment) - { - initializeDevice(); - - d_vertices_dev.row_assignments = d_row_assignment; - d_vertices_dev.col_assignments = d_col_assignment; - - d_costs_ = d_cost_matrix; - - int step = 0; - - while (step != 100) { - switch (step) { - case 0: step = hungarianStep0(); break; - case 1: step = hungarianStep1(); break; - case 2: step = hungarianStep2(); break; - case 3: step = hungarianStep3(); break; - case 4: step = hungarianStep4(); break; - case 5: step = hungarianStep5(); break; - case 6: step = hungarianStep6(); break; - } - } - - d_costs_ = nullptr; - } - - // Function for getting optimal row dual vector for subproblem spId. - std::pair getRowDualVector(int spId) const - { - return std::make_pair(row_duals_v.data() + spId * size_, size_); - } - - // Function for getting optimal col dual vector for subproblem spId. - std::pair getColDualVector(int spId) - { - return std::make_pair(col_duals_v.data() + spId * size_, size_); - } - - // Function for getting optimal primal objective value for subproblem spId. - weight_t getPrimalObjectiveValue(int spId) - { - weight_t result; - raft::update_host(&result, obj_val_primal_v.data() + spId, 1, handle_.get_stream()); - CHECK_CUDA(handle_.get_stream()); - return result; - } - - // Function for getting optimal dual objective value for subproblem spId. - weight_t getDualObjectiveValue(int spId) - { - weight_t result; - raft::update_host(&result, obj_val_dual_v.data() + spId, 1, handle_.get_stream()); - CHECK_CUDA(handle_.get_stream()); - return result; - } - - private: - // Helper function for initializing global variables and arrays on a single host. - void initializeDevice() - { - cudaStream_t stream = handle_.get_stream(); - row_covers_v.resize(batchsize_ * size_, stream); - col_covers_v.resize(batchsize_ * size_, stream); - row_duals_v.resize(batchsize_ * size_, stream); - col_duals_v.resize(batchsize_ * size_, stream); - col_slacks_v.resize(batchsize_ * size_, stream); - row_is_visited_v.resize(batchsize_ * size_, stream); - col_is_visited_v.resize(batchsize_ * size_, stream); - row_parents_v.resize(batchsize_ * size_, stream); - col_parents_v.resize(batchsize_ * size_, stream); - row_children_v.resize(batchsize_ * size_, stream); - col_children_v.resize(batchsize_ * size_, stream); - obj_val_primal_v.resize(batchsize_, stream); - obj_val_dual_v.resize(batchsize_, stream); - - d_vertices_dev.row_covers = row_covers_v.data(); - d_vertices_dev.col_covers = col_covers_v.data(); - - d_vertices_dev.row_duals = row_duals_v.data(); - d_vertices_dev.col_duals = col_duals_v.data(); - d_vertices_dev.col_slacks = col_slacks_v.data(); - - d_row_data_dev.is_visited = row_is_visited_v.data(); - d_col_data_dev.is_visited = col_is_visited_v.data(); - d_row_data_dev.parents = row_parents_v.data(); - d_row_data_dev.children = row_children_v.data(); - d_col_data_dev.parents = col_parents_v.data(); - d_col_data_dev.children = col_children_v.data(); - - thrust::fill(thrust::device, row_covers_v.begin(), row_covers_v.end(), int{0}); - thrust::fill(thrust::device, col_covers_v.begin(), col_covers_v.end(), int{0}); - thrust::fill(thrust::device, row_duals_v.begin(), row_duals_v.end(), weight_t{0}); - thrust::fill(thrust::device, col_duals_v.begin(), col_duals_v.end(), weight_t{0}); - } - - // Function for calculating initial zeros by subtracting row and column minima from each element. - int hungarianStep0() - { - detail::initialReduction(handle_, d_costs_, d_vertices_dev, batchsize_, size_); - - return 1; - } - - // Function for calculating initial zeros by subtracting row and column minima from each element. - int hungarianStep1() - { - detail::computeInitialAssignments( - handle_, d_costs_, d_vertices_dev, batchsize_, size_, epsilon_); - - int next = 2; - - while (true) { - if ((next = hungarianStep2()) == 6) break; - - if ((next = hungarianStep3()) == 5) break; - - hungarianStep4(); - } - - return next; - } - - // Function for checking optimality and constructing predicates and covers. - int hungarianStep2() - { - int cover_count = detail::computeRowCovers( - handle_, d_vertices_dev, d_row_data_dev, d_col_data_dev, batchsize_, size_); - - int next = (cover_count == batchsize_ * size_) ? 6 : 3; - - return next; - } - - // Function for building alternating tree rooted at unassigned rows. - int hungarianStep3() - { - int next; - - rmm::device_scalar flag_v(handle_.get_stream()); - - bool h_flag = false; - flag_v.set_value_async(h_flag, handle_.get_stream()); - - detail::executeZeroCover(handle_, - d_costs_, - d_vertices_dev, - d_row_data_dev, - d_col_data_dev, - flag_v.data(), - batchsize_, - size_, - epsilon_); - - h_flag = flag_v.value(handle_.get_stream()); - - next = h_flag ? 4 : 5; - - return next; - } - - // Function for augmenting the solution along multiple node-disjoint alternating trees. - int hungarianStep4() - { - detail::reversePass(handle_, d_row_data_dev, d_col_data_dev, batchsize_, size_); - - detail::augmentationPass( - handle_, d_vertices_dev, d_row_data_dev, d_col_data_dev, batchsize_, size_); - - return 2; - } - - // Function for updating dual solution to introduce new zero-cost arcs. - int hungarianStep5() - { - detail::dualUpdate( - handle_, d_vertices_dev, d_row_data_dev, d_col_data_dev, batchsize_, size_, epsilon_); - - return 3; - } - - // Function for calculating primal and dual objective values at optimality. - int hungarianStep6() - { - detail::calcObjValPrimal(handle_, - obj_val_primal_v.data(), - d_costs_, - d_vertices_dev.row_assignments, - batchsize_, - size_); - - detail::calcObjValDual(handle_, obj_val_dual_v.data(), d_vertices_dev, batchsize_, size_); +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/solver version instead.") - return 100; - } -}; +#include -} // namespace lap -} // namespace raft +using raft::solver::VertexData; +using raft::solver::Vertices; -#endif \ No newline at end of file +namespace raft::lap { +using raft::solver::LinearAssignmentProblem; +} diff --git a/cpp/include/raft/lap/lap.hpp b/cpp/include/raft/lap/lap.hpp index badafb8afd..30f2b53e52 100644 --- a/cpp/include/raft/lap/lap.hpp +++ b/cpp/include/raft/lap/lap.hpp @@ -28,4 +28,4 @@ " is deprecated and will be removed in a future release." \ " Please use the cuh version instead.") -#include "lap.cuh" +#include diff --git a/cpp/include/raft/linalg/binary_op.cuh b/cpp/include/raft/linalg/binary_op.cuh index a85bf698f7..c3827f79bf 100644 --- a/cpp/include/raft/linalg/binary_op.cuh +++ b/cpp/include/raft/linalg/binary_op.cuh @@ -20,7 +20,7 @@ #include "detail/binary_op.cuh" -#include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/add.cuh b/cpp/include/raft/linalg/detail/add.cuh index 288ac228c9..3cd583faa5 100644 --- a/cpp/include/raft/linalg/detail/add.cuh +++ b/cpp/include/raft/linalg/detail/add.cuh @@ -18,9 +18,9 @@ #include "functional.cuh" -#include #include #include +#include #include diff --git a/cpp/include/raft/linalg/detail/axpy.cuh b/cpp/include/raft/linalg/detail/axpy.cuh index c0ce398de9..f3e1a177c8 100644 --- a/cpp/include/raft/linalg/detail/axpy.cuh +++ b/cpp/include/raft/linalg/detail/axpy.cuh @@ -20,7 +20,7 @@ #include "cublas_wrappers.hpp" -#include +#include namespace raft::linalg::detail { diff --git a/cpp/include/raft/linalg/detail/binary_op.cuh b/cpp/include/raft/linalg/detail/binary_op.cuh index 6b1f8bc6d7..d073e164fd 100644 --- a/cpp/include/raft/linalg/detail/binary_op.cuh +++ b/cpp/include/raft/linalg/detail/binary_op.cuh @@ -16,7 +16,7 @@ #pragma once -#include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/cholesky_r1_update.cuh b/cpp/include/raft/linalg/detail/cholesky_r1_update.cuh index df1fb0a1f3..a1d6ebbe6e 100644 --- a/cpp/include/raft/linalg/detail/cholesky_r1_update.cuh +++ b/cpp/include/raft/linalg/detail/cholesky_r1_update.cuh @@ -18,7 +18,7 @@ #include "cublas_wrappers.hpp" #include "cusolver_wrappers.hpp" -#include +#include #include namespace raft { diff --git a/cpp/include/raft/linalg/detail/coalesced_reduction.cuh b/cpp/include/raft/linalg/detail/coalesced_reduction.cuh index 7e545e4932..cf1b8cf5a5 100644 --- a/cpp/include/raft/linalg/detail/coalesced_reduction.cuh +++ b/cpp/include/raft/linalg/detail/coalesced_reduction.cuh @@ -17,7 +17,7 @@ #pragma once #include -#include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/contractions.cuh b/cpp/include/raft/linalg/detail/contractions.cuh index 0261d1967e..5d83f88e71 100644 --- a/cpp/include/raft/linalg/detail/contractions.cuh +++ b/cpp/include/raft/linalg/detail/contractions.cuh @@ -16,7 +16,7 @@ #pragma once -#include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/cublas_wrappers.hpp b/cpp/include/raft/linalg/detail/cublas_wrappers.hpp index a55e1d6d7c..03975b1b7d 100644 --- a/cpp/include/raft/linalg/detail/cublas_wrappers.hpp +++ b/cpp/include/raft/linalg/detail/cublas_wrappers.hpp @@ -18,7 +18,7 @@ #include #include -#include +#include #include #include diff --git a/cpp/include/raft/linalg/detail/cusolver_wrappers.hpp b/cpp/include/raft/linalg/detail/cusolver_wrappers.hpp index e7da615748..3eff920dd8 100644 --- a/cpp/include/raft/linalg/detail/cusolver_wrappers.hpp +++ b/cpp/include/raft/linalg/detail/cusolver_wrappers.hpp @@ -19,7 +19,7 @@ #include #include #include -#include +#include #include namespace raft { diff --git a/cpp/include/raft/linalg/detail/eig.cuh b/cpp/include/raft/linalg/detail/eig.cuh index 1d9a6bfa8f..dfd6bd4f7c 100644 --- a/cpp/include/raft/linalg/detail/eig.cuh +++ b/cpp/include/raft/linalg/detail/eig.cuh @@ -18,9 +18,9 @@ #include "cusolver_wrappers.hpp" #include -#include -#include +#include #include +#include #include #include diff --git a/cpp/include/raft/linalg/detail/gemm.hpp b/cpp/include/raft/linalg/detail/gemm.hpp index 50a8be6018..5742048864 100644 --- a/cpp/include/raft/linalg/detail/gemm.hpp +++ b/cpp/include/raft/linalg/detail/gemm.hpp @@ -20,7 +20,7 @@ #include "cublas_wrappers.hpp" -#include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/gemv.hpp b/cpp/include/raft/linalg/detail/gemv.hpp index ad2e5275cb..38fcdcd82e 100644 --- a/cpp/include/raft/linalg/detail/gemv.hpp +++ b/cpp/include/raft/linalg/detail/gemv.hpp @@ -20,7 +20,7 @@ #include "cublas_wrappers.hpp" -#include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/lanczos.cuh b/cpp/include/raft/linalg/detail/lanczos.cuh index 9fa0d79875..5a3c595512 100644 --- a/cpp/include/raft/linalg/detail/lanczos.cuh +++ b/cpp/include/raft/linalg/detail/lanczos.cuh @@ -26,11 +26,11 @@ #include #include "cublas_wrappers.hpp" -#include -#include +#include #include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/lstsq.cuh b/cpp/include/raft/linalg/detail/lstsq.cuh index 4ce8275e08..1273956b21 100644 --- a/cpp/include/raft/linalg/detail/lstsq.cuh +++ b/cpp/include/raft/linalg/detail/lstsq.cuh @@ -18,7 +18,6 @@ #include #include -#include #include #include #include @@ -30,6 +29,7 @@ #include #include #include +#include #include #include #include diff --git a/cpp/include/raft/linalg/detail/map.cuh b/cpp/include/raft/linalg/detail/map.cuh index 56f1dd6f19..2c73521887 100644 --- a/cpp/include/raft/linalg/detail/map.cuh +++ b/cpp/include/raft/linalg/detail/map.cuh @@ -17,9 +17,9 @@ #pragma once #include -#include -#include -#include +#include +#include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/map_then_reduce.cuh b/cpp/include/raft/linalg/detail/map_then_reduce.cuh index 281861b2f9..9c0a21ee5c 100644 --- a/cpp/include/raft/linalg/detail/map_then_reduce.cuh +++ b/cpp/include/raft/linalg/detail/map_then_reduce.cuh @@ -17,9 +17,9 @@ #pragma once #include -#include -#include -#include +#include +#include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/reduce.cuh b/cpp/include/raft/linalg/detail/reduce.cuh index 4d5fa87202..f64631689a 100644 --- a/cpp/include/raft/linalg/detail/reduce.cuh +++ b/cpp/include/raft/linalg/detail/reduce.cuh @@ -16,9 +16,9 @@ #pragma once -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/reduce_cols_by_key.cuh b/cpp/include/raft/linalg/detail/reduce_cols_by_key.cuh index 54cf9aa204..b956fa900e 100644 --- a/cpp/include/raft/linalg/detail/reduce_cols_by_key.cuh +++ b/cpp/include/raft/linalg/detail/reduce_cols_by_key.cuh @@ -18,7 +18,7 @@ #include #include -#include +#include #include namespace raft { diff --git a/cpp/include/raft/linalg/detail/reduce_rows_by_key.cuh b/cpp/include/raft/linalg/detail/reduce_rows_by_key.cuh index 7550ce2093..007c05c0d4 100644 --- a/cpp/include/raft/linalg/detail/reduce_rows_by_key.cuh +++ b/cpp/include/raft/linalg/detail/reduce_rows_by_key.cuh @@ -16,7 +16,7 @@ #pragma once -#include +#include #include diff --git a/cpp/include/raft/linalg/detail/rsvd.cuh b/cpp/include/raft/linalg/detail/rsvd.cuh index 5487aead19..f96598d9e6 100644 --- a/cpp/include/raft/linalg/detail/rsvd.cuh +++ b/cpp/include/raft/linalg/detail/rsvd.cuh @@ -16,7 +16,6 @@ #pragma once -#include #include #include #include @@ -25,6 +24,7 @@ #include #include #include +#include #include diff --git a/cpp/include/raft/linalg/detail/strided_reduction.cuh b/cpp/include/raft/linalg/detail/strided_reduction.cuh index f7af9e88d6..d72bd54a32 100644 --- a/cpp/include/raft/linalg/detail/strided_reduction.cuh +++ b/cpp/include/raft/linalg/detail/strided_reduction.cuh @@ -18,8 +18,8 @@ #include "unary_op.cuh" #include -#include #include +#include #include namespace raft { diff --git a/cpp/include/raft/linalg/detail/subtract.cuh b/cpp/include/raft/linalg/detail/subtract.cuh index 084c6d2fd3..ae0f09d2fe 100644 --- a/cpp/include/raft/linalg/detail/subtract.cuh +++ b/cpp/include/raft/linalg/detail/subtract.cuh @@ -16,9 +16,9 @@ #pragma once -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/svd.cuh b/cpp/include/raft/linalg/detail/svd.cuh index aa33dcb0a9..97200a9919 100644 --- a/cpp/include/raft/linalg/detail/svd.cuh +++ b/cpp/include/raft/linalg/detail/svd.cuh @@ -23,11 +23,11 @@ #include #include -#include -#include -#include +#include #include #include +#include +#include #include #include diff --git a/cpp/include/raft/linalg/detail/ternary_op.cuh b/cpp/include/raft/linalg/detail/ternary_op.cuh index bcfcc9df01..46a5385d51 100644 --- a/cpp/include/raft/linalg/detail/ternary_op.cuh +++ b/cpp/include/raft/linalg/detail/ternary_op.cuh @@ -16,8 +16,8 @@ #pragma once -#include -#include +#include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/detail/transpose.cuh b/cpp/include/raft/linalg/detail/transpose.cuh index 242d3a3912..4f65544058 100644 --- a/cpp/include/raft/linalg/detail/transpose.cuh +++ b/cpp/include/raft/linalg/detail/transpose.cuh @@ -18,8 +18,8 @@ #include "cublas_wrappers.hpp" +#include #include -#include #include #include #include diff --git a/cpp/include/raft/linalg/detail/unary_op.cuh b/cpp/include/raft/linalg/detail/unary_op.cuh index 9ddfe79657..cdadc6f868 100644 --- a/cpp/include/raft/linalg/detail/unary_op.cuh +++ b/cpp/include/raft/linalg/detail/unary_op.cuh @@ -16,9 +16,9 @@ #pragma once -#include -#include -#include +#include +#include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/lanczos.cuh b/cpp/include/raft/linalg/lanczos.cuh index a7157adfab..c9f3e0010e 100644 --- a/cpp/include/raft/linalg/lanczos.cuh +++ b/cpp/include/raft/linalg/lanczos.cuh @@ -13,150 +13,24 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -#ifndef __LANCZOS_H -#define __LANCZOS_H - -#pragma once - -#include "detail/lanczos.cuh" -#include - -namespace raft { -namespace linalg { - -// ========================================================= -// Eigensolver -// ========================================================= - /** - * @brief Compute smallest eigenvectors of symmetric matrix - * Computes eigenvalues and eigenvectors that are least - * positive. If matrix is positive definite or positive - * semidefinite, the computed eigenvalues are smallest in - * magnitude. - * The largest eigenvalue is estimated by performing several - * Lanczos iterations. An implicitly restarted Lanczos method is - * then applied to A+s*I, where s is negative the largest - * eigenvalue. - * @tparam index_type_t the type of data used for indexing. - * @tparam value_type_t the type of data used for weights, distances. - * @param handle the raft handle. - * @param A Matrix. - * @param nEigVecs Number of eigenvectors to compute. - * @param maxIter Maximum number of Lanczos steps. Does not include - * Lanczos steps used to estimate largest eigenvalue. - * @param restartIter Maximum size of Lanczos system before - * performing an implicit restart. Should be at least 4. - * @param tol Convergence tolerance. Lanczos iteration will - * terminate when the residual norm is less than tol*theta, where - * theta is an estimate for the smallest unwanted eigenvalue - * (i.e. the (nEigVecs+1)th smallest eigenvalue). - * @param reorthogonalize Whether to reorthogonalize Lanczos - * vectors. - * @param iter On exit, pointer to total number of Lanczos - * iterations performed. Does not include Lanczos steps used to - * estimate largest eigenvalue. - * @param eigVals_dev (Output, device memory, nEigVecs entries) - * Smallest eigenvalues of matrix. - * @param eigVecs_dev (Output, device memory, n*nEigVecs entries) - * Eigenvectors corresponding to smallest eigenvalues of - * matrix. Vectors are stored as columns of a column-major matrix - * with dimensions n x nEigVecs. - * @param seed random seed. - * @return error flag. + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -template -int computeSmallestEigenvectors( - handle_t const& handle, - spectral::matrix::sparse_matrix_t const& A, - index_type_t nEigVecs, - index_type_t maxIter, - index_type_t restartIter, - value_type_t tol, - bool reorthogonalize, - index_type_t& iter, - value_type_t* __restrict__ eigVals_dev, - value_type_t* __restrict__ eigVecs_dev, - unsigned long long seed = 1234567) -{ - return detail::computeSmallestEigenvectors(handle, - A, - nEigVecs, - maxIter, - restartIter, - tol, - reorthogonalize, - iter, - eigVals_dev, - eigVecs_dev, - seed); -} /** - * @brief Compute largest eigenvectors of symmetric matrix - * Computes eigenvalues and eigenvectors that are least - * positive. If matrix is positive definite or positive - * semidefinite, the computed eigenvalues are largest in - * magnitude. - * The largest eigenvalue is estimated by performing several - * Lanczos iterations. An implicitly restarted Lanczos method is - * then applied to A+s*I, where s is negative the largest - * eigenvalue. - * @tparam index_type_t the type of data used for indexing. - * @tparam value_type_t the type of data used for weights, distances. - * @param handle the raft handle. - * @param A Matrix. - * @param nEigVecs Number of eigenvectors to compute. - * @param maxIter Maximum number of Lanczos steps. Does not include - * Lanczos steps used to estimate largest eigenvalue. - * @param restartIter Maximum size of Lanczos system before - * performing an implicit restart. Should be at least 4. - * @param tol Convergence tolerance. Lanczos iteration will - * terminate when the residual norm is less than tol*theta, where - * theta is an estimate for the largest unwanted eigenvalue - * (i.e. the (nEigVecs+1)th largest eigenvalue). - * @param reorthogonalize Whether to reorthogonalize Lanczos - * vectors. - * @param iter On exit, pointer to total number of Lanczos - * iterations performed. Does not include Lanczos steps used to - * estimate largest eigenvalue. - * @param eigVals_dev (Output, device memory, nEigVecs entries) - * Largest eigenvalues of matrix. - * @param eigVecs_dev (Output, device memory, n*nEigVecs entries) - * Eigenvectors corresponding to largest eigenvalues of - * matrix. Vectors are stored as columns of a column-major matrix - * with dimensions n x nEigVecs. - * @param seed random seed. - * @return error flag. + * DISCLAIMER: this file is deprecated: use lanczos.cuh instead */ -template -int computeLargestEigenvectors( - handle_t const& handle, - spectral::matrix::sparse_matrix_t const& A, - index_type_t nEigVecs, - index_type_t maxIter, - index_type_t restartIter, - value_type_t tol, - bool reorthogonalize, - index_type_t& iter, - value_type_t* __restrict__ eigVals_dev, - value_type_t* __restrict__ eigVecs_dev, - unsigned long long seed = 123456) -{ - return detail::computeLargestEigenvectors(handle, - A, - nEigVecs, - maxIter, - restartIter, - tol, - reorthogonalize, - iter, - eigVals_dev, - eigVecs_dev, - seed); -} -} // namespace linalg -} // namespace raft +#pragma once + +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the sparse solvers version instead.") + +#include -#endif \ No newline at end of file +namespace raft::linalg { +using raft::sparse::solver::computeLargestEigenvectors; +using raft::sparse::solver::computeSmallestEigenvectors; +} // namespace raft::linalg diff --git a/cpp/include/raft/linalg/lanczos.hpp b/cpp/include/raft/linalg/lanczos.hpp index 0529db6b5b..2141e4e908 100644 --- a/cpp/include/raft/linalg/lanczos.hpp +++ b/cpp/include/raft/linalg/lanczos.hpp @@ -26,6 +26,6 @@ #pragma message(__FILE__ \ " is deprecated and will be removed in a future release." \ - " Please use the cuh version instead.") + " Please use the sparse/solvers version instead.") -#include "lanczos.cuh" +#include diff --git a/cpp/include/raft/linalg/lstsq.cuh b/cpp/include/raft/linalg/lstsq.cuh index 255f1293f4..1a4c5cf704 100644 --- a/cpp/include/raft/linalg/lstsq.cuh +++ b/cpp/include/raft/linalg/lstsq.cuh @@ -18,7 +18,7 @@ #pragma once -#include +#include #include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/power.cuh b/cpp/include/raft/linalg/power.cuh index f94fcfc894..69f3e4d22b 100644 --- a/cpp/include/raft/linalg/power.cuh +++ b/cpp/include/raft/linalg/power.cuh @@ -18,9 +18,9 @@ #pragma once -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/linalg/sqrt.cuh b/cpp/include/raft/linalg/sqrt.cuh index b58bc752ac..c81e38eace 100644 --- a/cpp/include/raft/linalg/sqrt.cuh +++ b/cpp/include/raft/linalg/sqrt.cuh @@ -18,8 +18,8 @@ #pragma once -#include #include +#include namespace raft { namespace linalg { diff --git a/cpp/include/raft/matrix/detail/columnWiseSort.cuh b/cpp/include/raft/matrix/detail/columnWiseSort.cuh index 65febcb6d8..97345aecb6 100644 --- a/cpp/include/raft/matrix/detail/columnWiseSort.cuh +++ b/cpp/include/raft/matrix/detail/columnWiseSort.cuh @@ -20,7 +20,7 @@ #include #include #include -#include +#include #define INST_BLOCK_SORT(keyIn, keyOut, valueInOut, rows, columns, blockSize, elemPT, stream) \ devKeyValSortColumnPerRow<<>>( \ diff --git a/cpp/include/raft/matrix/detail/linewise_op.cuh b/cpp/include/raft/matrix/detail/linewise_op.cuh index ee703c5138..15f5204382 100644 --- a/cpp/include/raft/matrix/detail/linewise_op.cuh +++ b/cpp/include/raft/matrix/detail/linewise_op.cuh @@ -16,9 +16,9 @@ #pragma once -#include -#include -#include +#include +#include +#include #include diff --git a/cpp/include/raft/matrix/detail/math.cuh b/cpp/include/raft/matrix/detail/math.cuh index 9e996e19d9..95953feca4 100644 --- a/cpp/include/raft/matrix/detail/math.cuh +++ b/cpp/include/raft/matrix/detail/math.cuh @@ -16,14 +16,14 @@ #pragma once -#include +#include #include -#include #include #include #include #include +#include #include #include diff --git a/cpp/include/raft/matrix/detail/matrix.cuh b/cpp/include/raft/matrix/detail/matrix.cuh index 3683132161..a8568b0859 100644 --- a/cpp/include/raft/matrix/detail/matrix.cuh +++ b/cpp/include/raft/matrix/detail/matrix.cuh @@ -16,8 +16,8 @@ #pragma once -#include -#include +#include +#include #include @@ -28,9 +28,9 @@ #include #include #include -#include -#include +#include #include +#include namespace raft { namespace matrix { diff --git a/cpp/include/raft/pow2_utils.cuh b/cpp/include/raft/pow2_utils.cuh index 93f81db1ac..f1ecabf0eb 100644 --- a/cpp/include/raft/pow2_utils.cuh +++ b/cpp/include/raft/pow2_utils.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2021, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,152 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -#include "cuda_utils.cuh" - -namespace raft { - /** - * @brief Fast arithmetics and alignment checks for power-of-two values known at compile time. - * - * @tparam Value_ a compile-time value representable as a power-of-two. + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -template -struct Pow2 { - typedef decltype(Value_) Type; - static constexpr Type Value = Value_; - static constexpr Type Log2 = log2(Value); - static constexpr Type Mask = Value - 1; - - static_assert(std::is_integral::value, "Value must be integral."); - static_assert(Value && !(Value & Mask), "Value must be power of two."); - -#define Pow2_FUNC_QUALIFIER static constexpr __host__ __device__ __forceinline__ -#define Pow2_WHEN_INTEGRAL(I) std::enable_if_t -#define Pow2_IS_REPRESENTABLE_AS(I) (std::is_integral::value && Type(I(Value)) == Value) - - /** - * Integer division by Value truncated toward zero - * (same as `x / Value` in C++). - * - * Invariant: `x = Value * quot(x) + rem(x)` - */ - template - Pow2_FUNC_QUALIFIER Pow2_WHEN_INTEGRAL(I) quot(I x) noexcept - { - if constexpr (std::is_signed::value) return (x >> I(Log2)) + (x < 0 && (x & I(Mask))); - if constexpr (std::is_unsigned::value) return x >> I(Log2); - } - /** - * Remainder of integer division by Value truncated toward zero - * (same as `x % Value` in C++). - * - * Invariant: `x = Value * quot(x) + rem(x)`. - */ - template - Pow2_FUNC_QUALIFIER Pow2_WHEN_INTEGRAL(I) rem(I x) noexcept - { - if constexpr (std::is_signed::value) return x < 0 ? -((-x) & I(Mask)) : (x & I(Mask)); - if constexpr (std::is_unsigned::value) return x & I(Mask); - } - - /** - * Integer division by Value truncated toward negative infinity - * (same as `x // Value` in Python). - * - * Invariant: `x = Value * div(x) + mod(x)`. - * - * Note, `div` and `mod` for negative values are slightly faster - * than `quot` and `rem`, but behave slightly different - * compared to normal C++ operators `/` and `%`. - */ - template - Pow2_FUNC_QUALIFIER Pow2_WHEN_INTEGRAL(I) div(I x) noexcept - { - return x >> I(Log2); - } - - /** - * x modulo Value operation (remainder of the `div(x)`) - * (same as `x % Value` in Python). - * - * Invariant: `mod(x) >= 0` - * Invariant: `x = Value * div(x) + mod(x)`. - * - * Note, `div` and `mod` for negative values are slightly faster - * than `quot` and `rem`, but behave slightly different - * compared to normal C++ operators `/` and `%`. - */ - template - Pow2_FUNC_QUALIFIER Pow2_WHEN_INTEGRAL(I) mod(I x) noexcept - { - return x & I(Mask); - } - -#define Pow2_CHECK_TYPE(T) \ - static_assert(std::is_pointer::value || std::is_integral::value, \ - "Only pointer or integral types make sense here") - - /** - * Tell whether the pointer or integral is Value-aligned. - * NB: for pointers, the alignment is checked in bytes, not in elements. - */ - template - Pow2_FUNC_QUALIFIER bool isAligned(PtrT p) noexcept - { - Pow2_CHECK_TYPE(PtrT); - if constexpr (Pow2_IS_REPRESENTABLE_AS(PtrT)) return mod(p) == 0; - if constexpr (!Pow2_IS_REPRESENTABLE_AS(PtrT)) return mod(reinterpret_cast(p)) == 0; - } - - /** Tell whether two pointers have the same address modulo Value. */ - template - Pow2_FUNC_QUALIFIER bool areSameAlignOffsets(PtrT a, PtrS b) noexcept - { - Pow2_CHECK_TYPE(PtrT); - Pow2_CHECK_TYPE(PtrS); - Type x, y; - if constexpr (Pow2_IS_REPRESENTABLE_AS(PtrT)) - x = Type(mod(a)); - else - x = mod(reinterpret_cast(a)); - if constexpr (Pow2_IS_REPRESENTABLE_AS(PtrS)) - y = Type(mod(b)); - else - y = mod(reinterpret_cast(b)); - return x == y; - } +/** + * DISCLAIMER: this file is deprecated: use lap.cuh instead + */ - /** Get this or next Value-aligned address (in bytes) or integral. */ - template - Pow2_FUNC_QUALIFIER PtrT roundUp(PtrT p) noexcept - { - Pow2_CHECK_TYPE(PtrT); - if constexpr (Pow2_IS_REPRESENTABLE_AS(PtrT)) return (p + PtrT(Mask)) & PtrT(~Mask); - if constexpr (!Pow2_IS_REPRESENTABLE_AS(PtrT)) { - auto x = reinterpret_cast(p); - return reinterpret_cast((x + Mask) & (~Mask)); - } - } +#pragma once - /** Get this or previous Value-aligned address (in bytes) or integral. */ - template - Pow2_FUNC_QUALIFIER PtrT roundDown(PtrT p) noexcept - { - Pow2_CHECK_TYPE(PtrT); - if constexpr (Pow2_IS_REPRESENTABLE_AS(PtrT)) return p & PtrT(~Mask); - if constexpr (!Pow2_IS_REPRESENTABLE_AS(PtrT)) { - auto x = reinterpret_cast(p); - return reinterpret_cast(x & (~Mask)); - } - } -#undef Pow2_CHECK_TYPE -#undef Pow2_IS_REPRESENTABLE_AS -#undef Pow2_FUNC_QUALIFIER -#undef Pow2_WHEN_INTEGRAL -}; +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/util version instead.") -}; // namespace raft +#include diff --git a/cpp/include/raft/random/detail/make_blobs.cuh b/cpp/include/raft/random/detail/make_blobs.cuh index f214abce58..212245a9bf 100644 --- a/cpp/include/raft/random/detail/make_blobs.cuh +++ b/cpp/include/raft/random/detail/make_blobs.cuh @@ -17,11 +17,11 @@ #pragma once #include "permute.cuh" -#include -#include #include #include #include +#include +#include #include #include diff --git a/cpp/include/raft/random/detail/make_regression.cuh b/cpp/include/raft/random/detail/make_regression.cuh index 5556abb8e8..f06e20d4a6 100644 --- a/cpp/include/raft/random/detail/make_regression.cuh +++ b/cpp/include/raft/random/detail/make_regression.cuh @@ -22,8 +22,7 @@ #include -#include -#include +#include #include #include #include @@ -32,6 +31,7 @@ #include #include #include +#include #include namespace raft::random { diff --git a/cpp/include/raft/random/detail/multi_variable_gaussian.cuh b/cpp/include/raft/random/detail/multi_variable_gaussian.cuh index 15789742fd..636d31c04e 100644 --- a/cpp/include/raft/random/detail/multi_variable_gaussian.cuh +++ b/cpp/include/raft/random/detail/multi_variable_gaussian.cuh @@ -17,13 +17,13 @@ #pragma once #include "curand_wrappers.hpp" #include -#include -#include -#include +#include #include #include #include #include +#include +#include #include // mvg.cuh takes in matrices that are colomn major (as in fortan) diff --git a/cpp/include/raft/random/detail/permute.cuh b/cpp/include/raft/random/detail/permute.cuh index 28eaf9136c..9582f69e34 100644 --- a/cpp/include/raft/random/detail/permute.cuh +++ b/cpp/include/raft/random/detail/permute.cuh @@ -18,9 +18,9 @@ #include #include -#include -#include -#include +#include +#include +#include namespace raft::random { namespace detail { diff --git a/cpp/include/raft/random/detail/rmat_rectangular_generator.cuh b/cpp/include/raft/random/detail/rmat_rectangular_generator.cuh index 8a1f23e785..ddb7214a1a 100644 --- a/cpp/include/raft/random/detail/rmat_rectangular_generator.cuh +++ b/cpp/include/raft/random/detail/rmat_rectangular_generator.cuh @@ -16,10 +16,10 @@ #pragma once -#include -#include #include #include +#include +#include namespace raft { namespace random { diff --git a/cpp/include/raft/random/detail/rng_device.cuh b/cpp/include/raft/random/detail/rng_device.cuh index f1e3389924..8f0bf9fe53 100644 --- a/cpp/include/raft/random/detail/rng_device.cuh +++ b/cpp/include/raft/random/detail/rng_device.cuh @@ -16,8 +16,8 @@ #pragma once -#include #include +#include #include diff --git a/cpp/include/raft/random/detail/rng_impl.cuh b/cpp/include/raft/random/detail/rng_impl.cuh index eead64942f..d4471a4560 100644 --- a/cpp/include/raft/random/detail/rng_impl.cuh +++ b/cpp/include/raft/random/detail/rng_impl.cuh @@ -16,11 +16,11 @@ #pragma once -#include -#include -#include #include #include +#include +#include +#include namespace raft { namespace random { diff --git a/cpp/include/raft/random/detail/rng_impl_deprecated.cuh b/cpp/include/raft/random/detail/rng_impl_deprecated.cuh index 29af59d502..f9b55dd9d0 100644 --- a/cpp/include/raft/random/detail/rng_impl_deprecated.cuh +++ b/cpp/include/raft/random/detail/rng_impl_deprecated.cuh @@ -23,11 +23,11 @@ #include "rng_device.cuh" #include -#include -#include -#include -#include +#include #include +#include +#include +#include #include #include diff --git a/cpp/include/raft/lap/detail/lap_functions.cuh b/cpp/include/raft/solver/detail/lap_functions.cuh similarity index 92% rename from cpp/include/raft/lap/detail/lap_functions.cuh rename to cpp/include/raft/solver/detail/lap_functions.cuh index 1c97392a87..cbfe12fd23 100644 --- a/cpp/include/raft/lap/detail/lap_functions.cuh +++ b/cpp/include/raft/solver/detail/lap_functions.cuh @@ -24,11 +24,11 @@ */ #pragma once -#include "d_structs.h" +#include -#include -#include -#include +#include +#include +#include #include #include @@ -39,9 +39,7 @@ #include -namespace raft { -namespace lap { -namespace detail { +namespace raft::solver::detail { const int BLOCKDIMX{64}; const int BLOCKDIMY{1}; @@ -110,8 +108,7 @@ inline void initialReduction(raft::handle_t const& handle, dim3 threads_per_block; int total_blocks = 0; - raft::lap::detail::calculateRectangularDims( - blocks_per_grid, threads_per_block, total_blocks, N, SP); + detail::calculateRectangularDims(blocks_per_grid, threads_per_block, total_blocks, N, SP); kernel_rowReduction<<>>( d_costs, d_vertices_dev.row_duals, SP, N, std::numeric_limits::max()); @@ -149,8 +146,7 @@ inline void computeInitialAssignments(raft::handle_t const& handle, thrust::fill_n(thrust::device, row_lock_v.data(), size, 0); thrust::fill_n(thrust::device, col_lock_v.data(), size, 0); - raft::lap::detail::calculateRectangularDims( - blocks_per_grid, threads_per_block, total_blocks, N, SP); + detail::calculateRectangularDims(blocks_per_grid, threads_per_block, total_blocks, N, SP); kernel_computeInitialAssignments<<>>( d_costs, @@ -191,8 +187,7 @@ inline int computeRowCovers(raft::handle_t const& handle, thrust::fill_n(thrust::device, d_col_data.parents, size, vertex_t{-1}); thrust::fill_n(thrust::device, d_col_data.children, size, vertex_t{-1}); - raft::lap::detail::calculateRectangularDims( - blocks_per_grid, threads_per_block, total_blocks, N, SP); + detail::calculateRectangularDims(blocks_per_grid, threads_per_block, total_blocks, N, SP); kernel_computeRowCovers<<>>( d_vertices.row_assignments, d_vertices.row_covers, d_row_data.is_visited, SP, N); @@ -219,8 +214,7 @@ inline void coverZeroAndExpand(raft::handle_t const& handle, dim3 blocks_per_grid; dim3 threads_per_block; - raft::lap::detail::calculateRectangularDims( - blocks_per_grid, threads_per_block, total_blocks, N, SP); + detail::calculateRectangularDims(blocks_per_grid, threads_per_block, total_blocks, N, SP); kernel_coverAndExpand<<>>( d_flag, @@ -266,8 +260,7 @@ inline vertex_t zeroCoverIteration(raft::handle_t const& handle, thrust::fill_n(thrust::device, csr_ptrs_v.data(), (SP + 1), vertex_t{-1}); - raft::lap::detail::calculateRectangularDims( - blocks_per_grid, threads_per_block, total_blocks, N, SP); + detail::calculateRectangularDims(blocks_per_grid, threads_per_block, total_blocks, N, SP); // construct predicate matrix for edges. kernel_rowPredicateConstructionCSR<< predicates_v(size, handle.get_stream()); rmm::device_uvector addresses_v(size, handle.get_stream()); @@ -375,8 +368,7 @@ inline void reversePass(raft::handle_t const& handle, int total_blocks_1 = 0; dim3 blocks_per_grid_1; dim3 threads_per_block_1; - raft::lap::detail::calculateLinearDims( - blocks_per_grid_1, threads_per_block_1, total_blocks_1, csr_size); + detail::calculateLinearDims(blocks_per_grid_1, threads_per_block_1, total_blocks_1, csr_size); rmm::device_uvector elements_v(csr_size, handle.get_stream()); @@ -403,7 +395,7 @@ inline void augmentationPass(raft::handle_t const& handle, int total_blocks = 0; dim3 blocks_per_grid; dim3 threads_per_block; - raft::lap::detail::calculateLinearDims(blocks_per_grid, threads_per_block, total_blocks, SP * N); + detail::calculateLinearDims(blocks_per_grid, threads_per_block, total_blocks, SP * N); rmm::device_uvector predicates_v(SP * N, handle.get_stream()); rmm::device_uvector addresses_v(SP * N, handle.get_stream()); @@ -432,7 +424,7 @@ inline void augmentationPass(raft::handle_t const& handle, int total_blocks_1 = 0; dim3 blocks_per_grid_1; dim3 threads_per_block_1; - raft::lap::detail::calculateLinearDims( + detail::calculateLinearDims( blocks_per_grid_1, threads_per_block_1, total_blocks_1, row_ids_csr_size); rmm::device_uvector elements_v(row_ids_csr_size, handle.get_stream()); @@ -470,7 +462,7 @@ inline void dualUpdate(raft::handle_t const& handle, rmm::device_uvector sp_min_v(SP, handle.get_stream()); - raft::lap::detail::calculateLinearDims(blocks_per_grid, threads_per_block, total_blocks, SP); + detail::calculateLinearDims(blocks_per_grid, threads_per_block, total_blocks, SP); kernel_dualUpdate_1<<>>( sp_min_v.data(), d_vertices_dev.col_slacks, @@ -481,8 +473,7 @@ inline void dualUpdate(raft::handle_t const& handle, CHECK_CUDA(handle.get_stream()); - raft::lap::detail::calculateRectangularDims( - blocks_per_grid, threads_per_block, total_blocks, N, SP); + detail::calculateRectangularDims(blocks_per_grid, threads_per_block, total_blocks, N, SP); kernel_dualUpdate_2<<>>( sp_min_v.data(), d_vertices_dev.row_duals, @@ -512,7 +503,7 @@ inline void calcObjValDual(raft::handle_t const& handle, dim3 threads_per_block; int total_blocks = 0; - raft::lap::detail::calculateLinearDims(blocks_per_grid, threads_per_block, total_blocks, SP); + detail::calculateLinearDims(blocks_per_grid, threads_per_block, total_blocks, SP); kernel_calcObjValDual<<>>( d_obj_val, d_vertices_dev.row_duals, d_vertices_dev.col_duals, SP, N); @@ -533,7 +524,7 @@ inline void calcObjValPrimal(raft::handle_t const& handle, dim3 threads_per_block; int total_blocks = 0; - raft::lap::detail::calculateLinearDims(blocks_per_grid, threads_per_block, total_blocks, SP); + detail::calculateLinearDims(blocks_per_grid, threads_per_block, total_blocks, SP); kernel_calcObjValPrimal<<>>( d_obj_val, d_costs, d_row_assignments, SP, N); @@ -541,6 +532,4 @@ inline void calcObjValPrimal(raft::handle_t const& handle, CHECK_CUDA(handle.get_stream()); } -} // namespace detail -} // namespace lap -} // namespace raft +} // namespace raft::solver::detail diff --git a/cpp/include/raft/lap/detail/lap_kernels.cuh b/cpp/include/raft/solver/detail/lap_kernels.cuh similarity index 98% rename from cpp/include/raft/lap/detail/lap_kernels.cuh rename to cpp/include/raft/solver/detail/lap_kernels.cuh index 728acdf7df..d66a9d72d5 100644 --- a/cpp/include/raft/lap/detail/lap_kernels.cuh +++ b/cpp/include/raft/solver/detail/lap_kernels.cuh @@ -24,19 +24,16 @@ */ #pragma once -#include "d_structs.h" +#include "../linear_assignment_types.hpp" -#include -#include +#include +#include #include #include #include -namespace raft { -namespace lap { -namespace detail { - +namespace raft::solver::detail { const int DORMANT{0}; const int ACTIVE{1}; const int VISITED{2}; @@ -555,6 +552,4 @@ __global__ void kernel_calcObjValPrimal(weight_t* d_obj_val_primal, } } -} // namespace detail -} // namespace lap -} // namespace raft +} // namespace raft::solver::detail \ No newline at end of file diff --git a/cpp/include/raft/solver/linear_assignment.cuh b/cpp/include/raft/solver/linear_assignment.cuh new file mode 100644 index 0000000000..4c24dcbc29 --- /dev/null +++ b/cpp/include/raft/solver/linear_assignment.cuh @@ -0,0 +1,293 @@ +/* + * Copyright (c) 2020-2022, NVIDIA CORPORATION. + * Copyright 2020 KETAN DATE & RAKESH NAGI + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * CUDA Implementation of O(n^3) alternating tree Hungarian Algorithm + * Authors: Ketan Date and Rakesh Nagi + * + * Article reference: + * Date, Ketan, and Rakesh Nagi. "GPU-accelerated Hungarian algorithms + * for the Linear Assignment Problem." Parallel Computing 57 (2016): 52-72. + * + */ + +#ifndef __LAP_H +#define __LAP_H + +#pragma once + +#include +#include + +#include +#include + +#include +#include + +namespace raft::solver { + +template +class LinearAssignmentProblem { + vertex_t size_; + vertex_t batchsize_; + weight_t epsilon_; + + weight_t const* d_costs_; + + Vertices d_vertices_dev; + VertexData d_row_data_dev, d_col_data_dev; + + raft::handle_t const& handle_; + rmm::device_uvector row_covers_v; + rmm::device_uvector col_covers_v; + rmm::device_uvector row_duals_v; + rmm::device_uvector col_duals_v; + rmm::device_uvector col_slacks_v; + rmm::device_uvector row_is_visited_v; + rmm::device_uvector col_is_visited_v; + rmm::device_uvector row_parents_v; + rmm::device_uvector col_parents_v; + rmm::device_uvector row_children_v; + rmm::device_uvector col_children_v; + rmm::device_uvector obj_val_primal_v; + rmm::device_uvector obj_val_dual_v; + + public: + LinearAssignmentProblem(raft::handle_t const& handle, + vertex_t size, + vertex_t batchsize, + weight_t epsilon) + : handle_(handle), + size_(size), + batchsize_(batchsize), + epsilon_(epsilon), + d_costs_(nullptr), + row_covers_v(0, handle_.get_stream()), + col_covers_v(0, handle_.get_stream()), + row_duals_v(0, handle_.get_stream()), + col_duals_v(0, handle_.get_stream()), + col_slacks_v(0, handle_.get_stream()), + row_is_visited_v(0, handle_.get_stream()), + col_is_visited_v(0, handle_.get_stream()), + row_parents_v(0, handle_.get_stream()), + col_parents_v(0, handle_.get_stream()), + row_children_v(0, handle_.get_stream()), + col_children_v(0, handle_.get_stream()), + obj_val_primal_v(0, handle_.get_stream()), + obj_val_dual_v(0, handle_.get_stream()) + { + } + + // Executes Hungarian algorithm on the input cost matrix. + void solve(weight_t const* d_cost_matrix, vertex_t* d_row_assignment, vertex_t* d_col_assignment) + { + initializeDevice(); + + d_vertices_dev.row_assignments = d_row_assignment; + d_vertices_dev.col_assignments = d_col_assignment; + + d_costs_ = d_cost_matrix; + + int step = 0; + + while (step != 100) { + switch (step) { + case 0: step = hungarianStep0(); break; + case 1: step = hungarianStep1(); break; + case 2: step = hungarianStep2(); break; + case 3: step = hungarianStep3(); break; + case 4: step = hungarianStep4(); break; + case 5: step = hungarianStep5(); break; + case 6: step = hungarianStep6(); break; + } + } + + d_costs_ = nullptr; + } + + // Function for getting optimal row dual vector for subproblem spId. + std::pair getRowDualVector(int spId) const + { + return std::make_pair(row_duals_v.data() + spId * size_, size_); + } + + // Function for getting optimal col dual vector for subproblem spId. + std::pair getColDualVector(int spId) + { + return std::make_pair(col_duals_v.data() + spId * size_, size_); + } + + // Function for getting optimal primal objective value for subproblem spId. + weight_t getPrimalObjectiveValue(int spId) + { + weight_t result; + raft::update_host(&result, obj_val_primal_v.data() + spId, 1, handle_.get_stream()); + CHECK_CUDA(handle_.get_stream()); + return result; + } + + // Function for getting optimal dual objective value for subproblem spId. + weight_t getDualObjectiveValue(int spId) + { + weight_t result; + raft::update_host(&result, obj_val_dual_v.data() + spId, 1, handle_.get_stream()); + CHECK_CUDA(handle_.get_stream()); + return result; + } + + private: + // Helper function for initializing global variables and arrays on a single host. + void initializeDevice() + { + cudaStream_t stream = handle_.get_stream(); + row_covers_v.resize(batchsize_ * size_, stream); + col_covers_v.resize(batchsize_ * size_, stream); + row_duals_v.resize(batchsize_ * size_, stream); + col_duals_v.resize(batchsize_ * size_, stream); + col_slacks_v.resize(batchsize_ * size_, stream); + row_is_visited_v.resize(batchsize_ * size_, stream); + col_is_visited_v.resize(batchsize_ * size_, stream); + row_parents_v.resize(batchsize_ * size_, stream); + col_parents_v.resize(batchsize_ * size_, stream); + row_children_v.resize(batchsize_ * size_, stream); + col_children_v.resize(batchsize_ * size_, stream); + obj_val_primal_v.resize(batchsize_, stream); + obj_val_dual_v.resize(batchsize_, stream); + + d_vertices_dev.row_covers = row_covers_v.data(); + d_vertices_dev.col_covers = col_covers_v.data(); + + d_vertices_dev.row_duals = row_duals_v.data(); + d_vertices_dev.col_duals = col_duals_v.data(); + d_vertices_dev.col_slacks = col_slacks_v.data(); + + d_row_data_dev.is_visited = row_is_visited_v.data(); + d_col_data_dev.is_visited = col_is_visited_v.data(); + d_row_data_dev.parents = row_parents_v.data(); + d_row_data_dev.children = row_children_v.data(); + d_col_data_dev.parents = col_parents_v.data(); + d_col_data_dev.children = col_children_v.data(); + + thrust::fill(thrust::device, row_covers_v.begin(), row_covers_v.end(), int{0}); + thrust::fill(thrust::device, col_covers_v.begin(), col_covers_v.end(), int{0}); + thrust::fill(thrust::device, row_duals_v.begin(), row_duals_v.end(), weight_t{0}); + thrust::fill(thrust::device, col_duals_v.begin(), col_duals_v.end(), weight_t{0}); + } + + // Function for calculating initial zeros by subtracting row and column minima from each element. + int hungarianStep0() + { + detail::initialReduction(handle_, d_costs_, d_vertices_dev, batchsize_, size_); + + return 1; + } + + // Function for calculating initial zeros by subtracting row and column minima from each element. + int hungarianStep1() + { + detail::computeInitialAssignments( + handle_, d_costs_, d_vertices_dev, batchsize_, size_, epsilon_); + + int next = 2; + + while (true) { + if ((next = hungarianStep2()) == 6) break; + + if ((next = hungarianStep3()) == 5) break; + + hungarianStep4(); + } + + return next; + } + + // Function for checking optimality and constructing predicates and covers. + int hungarianStep2() + { + int cover_count = detail::computeRowCovers( + handle_, d_vertices_dev, d_row_data_dev, d_col_data_dev, batchsize_, size_); + + int next = (cover_count == batchsize_ * size_) ? 6 : 3; + + return next; + } + + // Function for building alternating tree rooted at unassigned rows. + int hungarianStep3() + { + int next; + + rmm::device_scalar flag_v(handle_.get_stream()); + + bool h_flag = false; + flag_v.set_value_async(h_flag, handle_.get_stream()); + + detail::executeZeroCover(handle_, + d_costs_, + d_vertices_dev, + d_row_data_dev, + d_col_data_dev, + flag_v.data(), + batchsize_, + size_, + epsilon_); + + h_flag = flag_v.value(handle_.get_stream()); + + next = h_flag ? 4 : 5; + + return next; + } + + // Function for augmenting the solution along multiple node-disjoint alternating trees. + int hungarianStep4() + { + detail::reversePass(handle_, d_row_data_dev, d_col_data_dev, batchsize_, size_); + + detail::augmentationPass( + handle_, d_vertices_dev, d_row_data_dev, d_col_data_dev, batchsize_, size_); + + return 2; + } + + // Function for updating dual solution to introduce new zero-cost arcs. + int hungarianStep5() + { + detail::dualUpdate( + handle_, d_vertices_dev, d_row_data_dev, d_col_data_dev, batchsize_, size_, epsilon_); + + return 3; + } + + // Function for calculating primal and dual objective values at optimality. + int hungarianStep6() + { + detail::calcObjValPrimal(handle_, + obj_val_primal_v.data(), + d_costs_, + d_vertices_dev.row_assignments, + batchsize_, + size_); + + detail::calcObjValDual(handle_, obj_val_dual_v.data(), d_vertices_dev, batchsize_, size_); + + return 100; + } +}; + +} // namespace raft::solver + +#endif \ No newline at end of file diff --git a/cpp/include/raft/lap/detail/d_structs.h b/cpp/include/raft/solver/linear_assignment_types.hpp similarity index 96% rename from cpp/include/raft/lap/detail/d_structs.h rename to cpp/include/raft/solver/linear_assignment_types.hpp index 74679d64ce..3f81d3898d 100644 --- a/cpp/include/raft/lap/detail/d_structs.h +++ b/cpp/include/raft/solver/linear_assignment_types.hpp @@ -24,6 +24,7 @@ */ #pragma once +namespace raft::solver { template struct Vertices { vertex_t* row_assignments; @@ -41,3 +42,4 @@ struct VertexData { vertex_t* children; int* is_visited; }; +} // namespace raft::solver diff --git a/cpp/include/raft/sparse/convert/detail/adj_to_csr.cuh b/cpp/include/raft/sparse/convert/detail/adj_to_csr.cuh index 4728574b55..4549fbe343 100644 --- a/cpp/include/raft/sparse/convert/detail/adj_to_csr.cuh +++ b/cpp/include/raft/sparse/convert/detail/adj_to_csr.cuh @@ -18,10 +18,10 @@ #include -#include -#include -#include -#include +#include +#include +#include +#include #include namespace raft { diff --git a/cpp/include/raft/sparse/convert/detail/coo.cuh b/cpp/include/raft/sparse/convert/detail/coo.cuh index 2d13bfa34e..7cc4770138 100644 --- a/cpp/include/raft/sparse/convert/detail/coo.cuh +++ b/cpp/include/raft/sparse/convert/detail/coo.cuh @@ -17,9 +17,9 @@ #pragma once #include -#include -#include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/convert/detail/csr.cuh b/cpp/include/raft/sparse/convert/detail/csr.cuh index d945a3c785..acb77de358 100644 --- a/cpp/include/raft/sparse/convert/detail/csr.cuh +++ b/cpp/include/raft/sparse/convert/detail/csr.cuh @@ -18,10 +18,10 @@ #include -#include -#include -#include +#include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/convert/detail/dense.cuh b/cpp/include/raft/sparse/convert/detail/dense.cuh index 4f97cee8b4..2be887e836 100644 --- a/cpp/include/raft/sparse/convert/detail/dense.cuh +++ b/cpp/include/raft/sparse/convert/detail/dense.cuh @@ -17,9 +17,9 @@ #pragma once #include -#include -#include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/detail/coo.cuh b/cpp/include/raft/sparse/detail/coo.cuh index 38a3c8f351..cbcbee0139 100644 --- a/cpp/include/raft/sparse/detail/coo.cuh +++ b/cpp/include/raft/sparse/detail/coo.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2019-2021, NVIDIA CORPORATION. + * Copyright (c) 2019-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -15,7 +15,7 @@ */ #include -#include +#include #include #pragma once diff --git a/cpp/include/raft/sparse/detail/csr.cuh b/cpp/include/raft/sparse/detail/csr.cuh index 1fd2bb9366..c0985779f4 100644 --- a/cpp/include/raft/sparse/detail/csr.cuh +++ b/cpp/include/raft/sparse/detail/csr.cuh @@ -17,9 +17,9 @@ #pragma once #include -#include -#include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/detail/cusparse_wrappers.h b/cpp/include/raft/sparse/detail/cusparse_wrappers.h index b9c4a61850..041991521b 100644 --- a/cpp/include/raft/sparse/detail/cusparse_wrappers.h +++ b/cpp/include/raft/sparse/detail/cusparse_wrappers.h @@ -17,7 +17,7 @@ #pragma once #include -#include +#include #include namespace raft { diff --git a/cpp/include/raft/sparse/distance/common.h b/cpp/include/raft/sparse/distance/common.h index 29c823bcdb..a69352d74b 100644 --- a/cpp/include/raft/sparse/distance/common.h +++ b/cpp/include/raft/sparse/distance/common.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2021, NVIDIA CORPORATION. + * Copyright (c) 2021-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,7 +16,7 @@ #pragma once -#include +#include namespace raft { namespace sparse { diff --git a/cpp/include/raft/sparse/distance/detail/bin_distance.cuh b/cpp/include/raft/sparse/distance/detail/bin_distance.cuh index 7c1229b0d3..cdcb0b7322 100644 --- a/cpp/include/raft/sparse/distance/detail/bin_distance.cuh +++ b/cpp/include/raft/sparse/distance/detail/bin_distance.cuh @@ -18,12 +18,12 @@ #include -#include -#include -#include +#include #include #include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/distance/detail/coo_spmv.cuh b/cpp/include/raft/sparse/distance/detail/coo_spmv.cuh index 9edd1305b3..53ef0326fb 100644 --- a/cpp/include/raft/sparse/distance/detail/coo_spmv.cuh +++ b/cpp/include/raft/sparse/distance/detail/coo_spmv.cuh @@ -19,9 +19,9 @@ #include "coo_spmv_strategies/dense_smem_strategy.cuh" #include "coo_spmv_strategies/hash_strategy.cuh" -#include -#include #include +#include +#include #include "../../csr.hpp" #include "../../detail/utils.h" diff --git a/cpp/include/raft/sparse/distance/detail/ip_distance.cuh b/cpp/include/raft/sparse/distance/detail/ip_distance.cuh index 0848d24bde..e791de10bb 100644 --- a/cpp/include/raft/sparse/distance/detail/ip_distance.cuh +++ b/cpp/include/raft/sparse/distance/detail/ip_distance.cuh @@ -17,10 +17,10 @@ #pragma once #include -#include -#include -#include +#include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/distance/detail/l2_distance.cuh b/cpp/include/raft/sparse/distance/detail/l2_distance.cuh index 234b08e933..1f55dadc58 100644 --- a/cpp/include/raft/sparse/distance/detail/l2_distance.cuh +++ b/cpp/include/raft/sparse/distance/detail/l2_distance.cuh @@ -18,15 +18,15 @@ #include -#include -#include -#include +#include #include #include #include #include #include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/distance/detail/lp_distance.cuh b/cpp/include/raft/sparse/distance/detail/lp_distance.cuh index c6ff32caf3..0707eb2a9b 100644 --- a/cpp/include/raft/sparse/distance/detail/lp_distance.cuh +++ b/cpp/include/raft/sparse/distance/detail/lp_distance.cuh @@ -18,9 +18,9 @@ #include -#include -#include -#include +#include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/distance/detail/operators.cuh b/cpp/include/raft/sparse/distance/detail/operators.cuh index b2c2e2172b..138b21e85b 100644 --- a/cpp/include/raft/sparse/distance/detail/operators.cuh +++ b/cpp/include/raft/sparse/distance/detail/operators.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2021, NVIDIA CORPORATION. + * Copyright (c) 2021-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,7 +16,7 @@ #pragma once -#include +#include namespace raft { namespace sparse { diff --git a/cpp/include/raft/sparse/distance/distance.cuh b/cpp/include/raft/sparse/distance/distance.cuh index ab189796ea..510e02822e 100644 --- a/cpp/include/raft/sparse/distance/distance.cuh +++ b/cpp/include/raft/sparse/distance/distance.cuh @@ -22,7 +22,7 @@ #include #include -#include +#include #include #include diff --git a/cpp/include/raft/sparse/hierarchy/common.h b/cpp/include/raft/sparse/hierarchy/common.h index 1738dd7498..5440ae4ae6 100644 --- a/cpp/include/raft/sparse/hierarchy/common.h +++ b/cpp/include/raft/sparse/hierarchy/common.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2021, NVIDIA CORPORATION. + * Copyright (c) 2021-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,39 +13,22 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -namespace raft { -namespace hierarchy { - -enum LinkageDistance { PAIRWISE = 0, KNN_GRAPH = 1 }; - /** - * Simple POCO for consolidating linkage results. This closely - * mirrors the trained instance variables populated in - * Scikit-learn's AgglomerativeClustering estimator. - * @tparam value_idx - * @tparam value_t + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -template -class linkage_output { - public: - value_idx m; - value_idx n_clusters; - - value_idx n_leaves; - value_idx n_connected_components; - value_idx* labels; // size: m +#pragma once - value_idx* children; // size: (m-1, 2) -}; +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use raft/cluster/single_linkage_types.hpp instead.") -class linkage_output_int_float : public linkage_output { -}; -class linkage_output__int64_float : public linkage_output { -}; +#include -}; // namespace hierarchy -}; // namespace raft \ No newline at end of file +namespace raft::hierarchy { +using raft::cluster::linkage_output; +using raft::cluster::linkage_output__int64_float; +using raft::cluster::linkage_output_int_float; +using raft::cluster::LinkageDistance; +} // namespace raft::hierarchy \ No newline at end of file diff --git a/cpp/include/raft/sparse/hierarchy/single_linkage.cuh b/cpp/include/raft/sparse/hierarchy/single_linkage.cuh index 86940005b4..dbf353da73 100644 --- a/cpp/include/raft/sparse/hierarchy/single_linkage.cuh +++ b/cpp/include/raft/sparse/hierarchy/single_linkage.cuh @@ -13,53 +13,20 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -#ifndef __SINGLE_LINKAGE_H -#define __SINGLE_LINKAGE_H +/** + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. + */ #pragma once -#include -#include - -namespace raft { -namespace hierarchy { +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/cluster version instead.") -/** - * Single-linkage clustering, capable of constructing a KNN graph to - * scale the algorithm beyond the n^2 memory consumption of implementations - * that use the fully-connected graph of pairwise distances by connecting - * a knn graph when k is not large enough to connect it. - - * @tparam value_idx - * @tparam value_t - * @tparam dist_type method to use for constructing connectivities graph - * @param[in] handle raft handle - * @param[in] X dense input matrix in row-major layout - * @param[in] m number of rows in X - * @param[in] n number of columns in X - * @param[in] metric distance metrix to use when constructing connectivities graph - * @param[out] out struct containing output dendrogram and cluster assignments - * @param[in] c a constant used when constructing connectivities from knn graph. Allows the indirect - control - * of k. The algorithm will set `k = log(n) + c` - * @param[in] n_clusters number of clusters to assign data samples - */ -template -void single_linkage(const raft::handle_t& handle, - const value_t* X, - size_t m, - size_t n, - raft::distance::DistanceType metric, - linkage_output* out, - int c, - size_t n_clusters) -{ - detail::single_linkage( - handle, X, m, n, metric, out, c, n_clusters); -} -}; // namespace hierarchy -}; // namespace raft +#include +#include -#endif \ No newline at end of file +namespace raft::hierarchy { +using raft::cluster::single_linkage; +} \ No newline at end of file diff --git a/cpp/include/raft/sparse/hierarchy/single_linkage.hpp b/cpp/include/raft/sparse/hierarchy/single_linkage.hpp index 80c3c3c521..72fe2e51a5 100644 --- a/cpp/include/raft/sparse/hierarchy/single_linkage.hpp +++ b/cpp/include/raft/sparse/hierarchy/single_linkage.hpp @@ -20,4 +20,8 @@ #pragma once -#include +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/cluster version instead.") + +#include diff --git a/cpp/include/raft/sparse/linalg/detail/add.cuh b/cpp/include/raft/sparse/linalg/detail/add.cuh index 5c3d07fc02..ea1356938e 100644 --- a/cpp/include/raft/sparse/linalg/detail/add.cuh +++ b/cpp/include/raft/sparse/linalg/detail/add.cuh @@ -18,9 +18,9 @@ #include -#include -#include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/linalg/detail/degree.cuh b/cpp/include/raft/sparse/linalg/detail/degree.cuh index bf5484d3a4..86fcdb58d6 100644 --- a/cpp/include/raft/sparse/linalg/detail/degree.cuh +++ b/cpp/include/raft/sparse/linalg/detail/degree.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2019-2021, NVIDIA CORPORATION. + * Copyright (c) 2019-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,8 +16,8 @@ #pragma once -#include -#include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/linalg/detail/norm.cuh b/cpp/include/raft/sparse/linalg/detail/norm.cuh index ba0ecd5dcc..c2a8aa4246 100644 --- a/cpp/include/raft/sparse/linalg/detail/norm.cuh +++ b/cpp/include/raft/sparse/linalg/detail/norm.cuh @@ -17,9 +17,9 @@ #pragma once #include -#include -#include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/linalg/detail/spectral.cuh b/cpp/include/raft/sparse/linalg/detail/spectral.cuh index c295932719..cdc0e62130 100644 --- a/cpp/include/raft/sparse/linalg/detail/spectral.cuh +++ b/cpp/include/raft/sparse/linalg/detail/spectral.cuh @@ -14,12 +14,12 @@ * limitations under the License. */ -#include +#include -#include #include #include #include +#include #include #include diff --git a/cpp/include/raft/sparse/linalg/detail/symmetrize.cuh b/cpp/include/raft/sparse/linalg/detail/symmetrize.cuh index 9143aac84f..358e7d6d29 100644 --- a/cpp/include/raft/sparse/linalg/detail/symmetrize.cuh +++ b/cpp/include/raft/sparse/linalg/detail/symmetrize.cuh @@ -18,14 +18,14 @@ #include -#include -#include #include +#include +#include #include #include -#include #include +#include #include #include diff --git a/cpp/include/raft/sparse/linalg/detail/transpose.h b/cpp/include/raft/sparse/linalg/detail/transpose.h index 4820b489d1..1484804348 100644 --- a/cpp/include/raft/sparse/linalg/detail/transpose.h +++ b/cpp/include/raft/sparse/linalg/detail/transpose.h @@ -18,9 +18,9 @@ #include -#include -#include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/linalg/spectral.cuh b/cpp/include/raft/sparse/linalg/spectral.cuh index fe95d1414c..0a97619e87 100644 --- a/cpp/include/raft/sparse/linalg/spectral.cuh +++ b/cpp/include/raft/sparse/linalg/spectral.cuh @@ -16,7 +16,7 @@ #ifndef __SPARSE_SPECTRAL_H #define __SPARSE_SPECTRAL_H -#include +#include #include namespace raft { diff --git a/cpp/include/raft/sparse/linalg/transpose.cuh b/cpp/include/raft/sparse/linalg/transpose.cuh index 8f0105f512..fa0031aab6 100644 --- a/cpp/include/raft/sparse/linalg/transpose.cuh +++ b/cpp/include/raft/sparse/linalg/transpose.cuh @@ -18,7 +18,7 @@ #pragma once -#include +#include #include namespace raft { diff --git a/cpp/include/raft/sparse/mst/mst.cuh b/cpp/include/raft/sparse/mst/mst.cuh index 70a6ff521f..8f1a365f3f 100644 --- a/cpp/include/raft/sparse/mst/mst.cuh +++ b/cpp/include/raft/sparse/mst/mst.cuh @@ -14,44 +14,20 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -#ifndef __MST_H -#define __MST_H +/** + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. + */ #pragma once -#include "mst_solver.cuh" - -namespace raft { -namespace mst { - -template -raft::Graph_COO mst(const raft::handle_t& handle, - edge_t const* offsets, - vertex_t const* indices, - weight_t const* weights, - vertex_t const v, - edge_t const e, - vertex_t* color, - cudaStream_t stream, - bool symmetrize_output = true, - bool initialize_colors = true, - int iterations = 0) -{ - MST_solver mst_solver(handle, - offsets, - indices, - weights, - v, - e, - color, - stream, - symmetrize_output, - initialize_colors, - iterations); - return mst_solver.solve(); -} +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/sparse/solver version instead.") -} // namespace mst -} // namespace raft +#include +#include -#endif \ No newline at end of file +namespace raft::mst { +using raft::sparse::solver::mst; +} \ No newline at end of file diff --git a/cpp/include/raft/sparse/mst/mst.hpp b/cpp/include/raft/sparse/mst/mst.hpp index 5a66e8c815..1ad053d97c 100644 --- a/cpp/include/raft/sparse/mst/mst.hpp +++ b/cpp/include/raft/sparse/mst/mst.hpp @@ -21,4 +21,9 @@ */ #pragma once -#include "mst.cuh" \ No newline at end of file +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/sparse/solver version instead.") + +#include +#include diff --git a/cpp/include/raft/sparse/mst/mst_solver.cuh b/cpp/include/raft/sparse/mst/mst_solver.cuh index bae5d77d8e..6af2226b99 100644 --- a/cpp/include/raft/sparse/mst/mst_solver.cuh +++ b/cpp/include/raft/sparse/mst/mst_solver.cuh @@ -1,6 +1,6 @@ /* - * Copyright (c) 2020, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -15,91 +15,22 @@ * limitations under the License. */ +/** + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. + */ #pragma once -#include -#include -#include - -namespace raft { - -template -struct Graph_COO { - rmm::device_uvector src; - rmm::device_uvector dst; - rmm::device_uvector weights; - edge_t n_edges; - - Graph_COO(vertex_t size, cudaStream_t stream) - : src(size, stream), dst(size, stream), weights(size, stream) - { - } -}; - -namespace mst { - -template -class MST_solver { - public: - MST_solver(const raft::handle_t& handle_, - const edge_t* offsets_, - const vertex_t* indices_, - const weight_t* weights_, - const vertex_t v_, - const edge_t e_, - vertex_t* color_, - cudaStream_t stream_, - bool symmetrize_output_, - bool initialize_colors_, - int iterations_); - - raft::Graph_COO solve(); +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/sparse/solver version instead.") - ~MST_solver() {} +#include - private: - const raft::handle_t& handle; - cudaStream_t stream; - bool symmetrize_output, initialize_colors; - int iterations; - - // CSR - const edge_t* offsets; - const vertex_t* indices; - const weight_t* weights; - const vertex_t v; - const edge_t e; - - vertex_t max_blocks; - vertex_t max_threads; - vertex_t sm_count; - - vertex_t* color_index; // represent each supervertex as a color - rmm::device_uvector min_edge_color; // minimum incident edge weight per color - rmm::device_uvector new_mst_edge; // new minimum edge per vertex - rmm::device_uvector altered_weights; // weights to be used for mst - rmm::device_scalar mst_edge_count; // total number of edges added after every iteration - rmm::device_scalar - prev_mst_edge_count; // total number of edges up to the previous iteration - rmm::device_uvector mst_edge; // mst output - true if the edge belongs in mst - rmm::device_uvector next_color; // next iteration color - rmm::device_uvector color; // index of color that vertex points to - - // new src-dst pairs found per iteration - rmm::device_uvector temp_src; - rmm::device_uvector temp_dst; - rmm::device_uvector temp_weights; - - void label_prop(vertex_t* mst_src, vertex_t* mst_dst); - void min_edge_per_vertex(); - void min_edge_per_supervertex(); - void check_termination(); - void alteration(); - alteration_t alteration_max(); - void append_src_dst_pair(vertex_t* mst_src, vertex_t* mst_dst, weight_t* mst_weights); -}; - -} // namespace mst -} // namespace raft +namespace raft { +using raft::sparse::solver::Graph_COO; +} -#include "detail/mst_solver_inl.cuh" +namespace raft::mst { +using raft::sparse::solver::MST_solver; +} \ No newline at end of file diff --git a/cpp/include/raft/sparse/op/detail/filter.cuh b/cpp/include/raft/sparse/op/detail/filter.cuh index ca0ffe8180..bcc0301318 100644 --- a/cpp/include/raft/sparse/op/detail/filter.cuh +++ b/cpp/include/raft/sparse/op/detail/filter.cuh @@ -18,9 +18,9 @@ #include -#include -#include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/op/detail/reduce.cuh b/cpp/include/raft/sparse/op/detail/reduce.cuh index eb747cce1e..b4d8cb7db9 100644 --- a/cpp/include/raft/sparse/op/detail/reduce.cuh +++ b/cpp/include/raft/sparse/op/detail/reduce.cuh @@ -18,12 +18,12 @@ #include -#include -#include #include +#include +#include -#include #include +#include #include #include diff --git a/cpp/include/raft/sparse/op/detail/row_op.cuh b/cpp/include/raft/sparse/op/detail/row_op.cuh index 63c8cafaa7..5e7d2632a9 100644 --- a/cpp/include/raft/sparse/op/detail/row_op.cuh +++ b/cpp/include/raft/sparse/op/detail/row_op.cuh @@ -18,9 +18,9 @@ #include -#include -#include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/op/detail/slice.cuh b/cpp/include/raft/sparse/op/detail/slice.cuh index 6bf6688076..193d246b4b 100644 --- a/cpp/include/raft/sparse/op/detail/slice.cuh +++ b/cpp/include/raft/sparse/op/detail/slice.cuh @@ -18,10 +18,10 @@ #include -#include -#include #include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/op/detail/sort.h b/cpp/include/raft/sparse/op/detail/sort.h index 17dbf6a70d..2f73671132 100644 --- a/cpp/include/raft/sparse/op/detail/sort.h +++ b/cpp/include/raft/sparse/op/detail/sort.h @@ -16,11 +16,11 @@ #pragma once -#include -#include #include #include #include +#include +#include #include #include diff --git a/cpp/include/raft/sparse/op/filter.cuh b/cpp/include/raft/sparse/op/filter.cuh index 6c36538137..488d926fe9 100644 --- a/cpp/include/raft/sparse/op/filter.cuh +++ b/cpp/include/raft/sparse/op/filter.cuh @@ -18,7 +18,7 @@ #pragma once -#include +#include #include #include diff --git a/cpp/include/raft/sparse/op/reduce.cuh b/cpp/include/raft/sparse/op/reduce.cuh index fd860d2dc1..cd67e124ee 100644 --- a/cpp/include/raft/sparse/op/reduce.cuh +++ b/cpp/include/raft/sparse/op/reduce.cuh @@ -18,7 +18,7 @@ #pragma once -#include +#include #include #include diff --git a/cpp/include/raft/sparse/op/row_op.cuh b/cpp/include/raft/sparse/op/row_op.cuh index b31d3f29b6..d73d05785d 100644 --- a/cpp/include/raft/sparse/op/row_op.cuh +++ b/cpp/include/raft/sparse/op/row_op.cuh @@ -17,7 +17,7 @@ #define __SPARSE_ROW_OP_H #pragma once -#include +#include #include namespace raft { diff --git a/cpp/include/raft/sparse/op/slice.cuh b/cpp/include/raft/sparse/op/slice.cuh index cd7be1924b..30f7a97ffc 100644 --- a/cpp/include/raft/sparse/op/slice.cuh +++ b/cpp/include/raft/sparse/op/slice.cuh @@ -18,7 +18,7 @@ #pragma once -#include +#include #include namespace raft { diff --git a/cpp/include/raft/sparse/op/sort.cuh b/cpp/include/raft/sparse/op/sort.cuh index ae0e587c3b..ddb4b2830c 100644 --- a/cpp/include/raft/sparse/op/sort.cuh +++ b/cpp/include/raft/sparse/op/sort.cuh @@ -18,7 +18,7 @@ #pragma once -#include +#include #include namespace raft { diff --git a/cpp/include/raft/sparse/selection/connect_components.cuh b/cpp/include/raft/sparse/selection/connect_components.cuh index 28bb5aa74b..22d8d7e936 100644 --- a/cpp/include/raft/sparse/selection/connect_components.cuh +++ b/cpp/include/raft/sparse/selection/connect_components.cuh @@ -13,70 +13,25 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -#ifndef __CONNECT_COMPONENTS_H -#define __CONNECT_COMPONENTS_H - -#include -#include -#include - -namespace raft { -namespace linkage { - -template -using FixConnectivitiesRedOp = detail::FixConnectivitiesRedOp; - /** - * Gets the number of unique components from array of - * colors or labels. This does not assume the components are - * drawn from a monotonically increasing set. - * @tparam value_idx - * @param[in] colors array of components - * @param[in] n_rows size of components array - * @param[in] stream cuda stream for which to order cuda operations - * @return total number of components + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -template -value_idx get_n_components(value_idx* colors, size_t n_rows, cudaStream_t stream) -{ - return detail::get_n_components(colors, n_rows, stream); -} /** - * Connects the components of an otherwise unconnected knn graph - * by computing a 1-nn to neighboring components of each data point - * (e.g. component(nn) != component(self)) and reducing the results to - * include the set of smallest destination components for each source - * component. The result will not necessarily contain - * n_components^2 - n_components number of elements because many components - * will likely not be contained in the neighborhoods of 1-nns. - * @tparam value_idx - * @tparam value_t - * @param[in] handle raft handle - * @param[out] out output edge list containing nearest cross-component - * edges. - * @param[in] X original (row-major) dense matrix for which knn graph should be constructed. - * @param[in] orig_colors array containing component number for each row of X - * @param[in] n_rows number of rows in X - * @param[in] n_cols number of cols in X - * @param[in] reduction_op - * @param[in] metric + * DISCLAIMER: this file is deprecated: use connect_components.cuh instead */ -template -void connect_components( - const raft::handle_t& handle, - raft::sparse::COO& out, - const value_t* X, - const value_idx* orig_colors, - size_t n_rows, - size_t n_cols, - red_op reduction_op, - raft::distance::DistanceType metric = raft::distance::DistanceType::L2SqrtExpanded) -{ - detail::connect_components(handle, out, X, orig_colors, n_rows, n_cols, reduction_op, metric); -} -}; // end namespace linkage -}; // end namespace raft +#pragma once + +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the sparse/spatial version instead.") + +#include -#endif \ No newline at end of file +namespace raft::linkage { +using raft::sparse::spatial::connect_components; +using raft::sparse::spatial::FixConnectivitiesRedOp; +using raft::sparse::spatial::get_n_components; +} // namespace raft::linkage \ No newline at end of file diff --git a/cpp/include/raft/sparse/selection/connect_components.hpp b/cpp/include/raft/sparse/selection/connect_components.hpp index b6597babc8..393ed2d4e2 100644 --- a/cpp/include/raft/sparse/selection/connect_components.hpp +++ b/cpp/include/raft/sparse/selection/connect_components.hpp @@ -26,6 +26,6 @@ #pragma message(__FILE__ \ " is deprecated and will be removed in a future release." \ - " Please use the cuh version instead.") + " Please use the sparse/spatial version instead.") #include "connect_components.cuh" diff --git a/cpp/include/raft/sparse/selection/knn.cuh b/cpp/include/raft/sparse/selection/knn.cuh index fd9ab4ac3d..f6895addd1 100644 --- a/cpp/include/raft/sparse/selection/knn.cuh +++ b/cpp/include/raft/sparse/selection/knn.cuh @@ -13,90 +13,23 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -#ifndef __SPARSE_KNN_H -#define __SPARSE_KNN_H - -#pragma once - -#include -#include -#include - -namespace raft { -namespace sparse { -namespace selection { +/** + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. + */ /** - * Search the sparse kNN for the k-nearest neighbors of a set of sparse query vectors - * using some distance implementation - * @param[in] idxIndptr csr indptr of the index matrix (size n_idx_rows + 1) - * @param[in] idxIndices csr column indices array of the index matrix (size n_idx_nnz) - * @param[in] idxData csr data array of the index matrix (size idxNNZ) - * @param[in] idxNNZ number of non-zeros for sparse index matrix - * @param[in] n_idx_rows number of data samples in index matrix - * @param[in] n_idx_cols - * @param[in] queryIndptr csr indptr of the query matrix (size n_query_rows + 1) - * @param[in] queryIndices csr indices array of the query matrix (size queryNNZ) - * @param[in] queryData csr data array of the query matrix (size queryNNZ) - * @param[in] queryNNZ number of non-zeros for sparse query matrix - * @param[in] n_query_rows number of data samples in query matrix - * @param[in] n_query_cols number of features in query matrix - * @param[out] output_indices dense matrix for output indices (size n_query_rows * k) - * @param[out] output_dists dense matrix for output distances (size n_query_rows * k) - * @param[in] k the number of neighbors to query - * @param[in] handle CUDA handle.get_stream() to order operations with respect to - * @param[in] batch_size_index maximum number of rows to use from index matrix per batch - * @param[in] batch_size_query maximum number of rows to use from query matrix per batch - * @param[in] metric distance metric/measure to use - * @param[in] metricArg potential argument for metric (currently unused) + * DISCLAIMER: this file is deprecated: use knn.cuh instead */ -template -void brute_force_knn(const value_idx* idxIndptr, - const value_idx* idxIndices, - const value_t* idxData, - size_t idxNNZ, - int n_idx_rows, - int n_idx_cols, - const value_idx* queryIndptr, - const value_idx* queryIndices, - const value_t* queryData, - size_t queryNNZ, - int n_query_rows, - int n_query_cols, - value_idx* output_indices, - value_t* output_dists, - int k, - const raft::handle_t& handle, - size_t batch_size_index = 2 << 14, // approx 1M - size_t batch_size_query = 2 << 14, - raft::distance::DistanceType metric = raft::distance::DistanceType::L2Expanded, - float metricArg = 0) -{ - detail::sparse_knn_t(idxIndptr, - idxIndices, - idxData, - idxNNZ, - n_idx_rows, - n_idx_cols, - queryIndptr, - queryIndices, - queryData, - queryNNZ, - n_query_rows, - n_query_cols, - output_indices, - output_dists, - k, - handle, - batch_size_index, - batch_size_query, - metric, - metricArg) - .run(); -} -}; // namespace selection -}; // namespace sparse -}; // namespace raft +#pragma once + +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the sparse/spatial version instead.") + +#include -#endif \ No newline at end of file +namespace raft::sparse::selection { +using raft::sparse::spatial::brute_force_knn; +} \ No newline at end of file diff --git a/cpp/include/raft/sparse/selection/knn.hpp b/cpp/include/raft/sparse/selection/knn.hpp index 6924e0b5a7..cd5e7b1fa3 100644 --- a/cpp/include/raft/sparse/selection/knn.hpp +++ b/cpp/include/raft/sparse/selection/knn.hpp @@ -26,6 +26,6 @@ #pragma message(__FILE__ \ " is deprecated and will be removed in a future release." \ - " Please use the cuh version instead.") + " Please use the sparse/spatial version instead.") #include "knn.cuh" diff --git a/cpp/include/raft/sparse/selection/knn_graph.cuh b/cpp/include/raft/sparse/selection/knn_graph.cuh index 7d342db43b..54cc52f4ae 100644 --- a/cpp/include/raft/sparse/selection/knn_graph.cuh +++ b/cpp/include/raft/sparse/selection/knn_graph.cuh @@ -13,51 +13,23 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -#ifndef __KNN_GRAPH_H -#define __KNN_GRAPH_H +/** + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. + */ -#pragma once +/** + * DISCLAIMER: this file is deprecated: use knn_graph.cuh instead + */ -#include -#include -#include +#pragma once -#include +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the sparse/spatial version instead.") -namespace raft { -namespace sparse { -namespace selection { +#include -/** - * Constructs a (symmetrized) knn graph edge list from - * dense input vectors. - * - * Note: The resulting KNN graph is not guaranteed to be connected. - * - * @tparam value_idx - * @tparam value_t - * @param[in] handle raft handle - * @param[in] X dense matrix of input data samples and observations - * @param[in] m number of data samples (rows) in X - * @param[in] n number of observations (columns) in X - * @param[in] metric distance metric to use when constructing neighborhoods - * @param[out] out output edge list - * @param c - */ -template -void knn_graph(const handle_t& handle, - const value_t* X, - std::size_t m, - std::size_t n, - raft::distance::DistanceType metric, - raft::sparse::COO& out, - int c = 15) -{ - detail::knn_graph(handle, X, m, n, metric, out, c); +namespace raft::sparse::selection { +using raft::sparse::spatial::knn_graph; } - -}; // namespace selection -}; // namespace sparse -}; // end namespace raft - -#endif \ No newline at end of file diff --git a/cpp/include/raft/sparse/selection/knn_graph.hpp b/cpp/include/raft/sparse/selection/knn_graph.hpp index 833bdb61d2..e8236b1732 100644 --- a/cpp/include/raft/sparse/selection/knn_graph.hpp +++ b/cpp/include/raft/sparse/selection/knn_graph.hpp @@ -26,6 +26,6 @@ #pragma message(__FILE__ \ " is deprecated and will be removed in a future release." \ - " Please use the cuh version instead.") + " Please use the sparse/spatial version instead.") #include "knn_graph.cuh" diff --git a/cpp/include/raft/sparse/solver/detail/lanczos.cuh b/cpp/include/raft/sparse/solver/detail/lanczos.cuh new file mode 100644 index 0000000000..49f4e01362 --- /dev/null +++ b/cpp/include/raft/sparse/solver/detail/lanczos.cuh @@ -0,0 +1,1396 @@ +/* + * Copyright (c) 2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +// for cmath: +#define _USE_MATH_DEFINES + +#include +#include + +#include +#include + +#include +#include +#include +#include +#include +#include + +namespace raft::sparse::solver::detail { + +// curandGeneratorNormalX +inline curandStatus_t curandGenerateNormalX( + curandGenerator_t generator, float* outputPtr, size_t n, float mean, float stddev) +{ + return curandGenerateNormal(generator, outputPtr, n, mean, stddev); +} +inline curandStatus_t curandGenerateNormalX( + curandGenerator_t generator, double* outputPtr, size_t n, double mean, double stddev) +{ + return curandGenerateNormalDouble(generator, outputPtr, n, mean, stddev); +} + +// ========================================================= +// Helper functions +// ========================================================= + +/** + * @brief Perform Lanczos iteration + * Lanczos iteration is performed on a shifted matrix A+shift*I. + * @tparam index_type_t the type of data used for indexing. + * @tparam value_type_t the type of data used for weights, distances. + * @param handle the raft handle. + * @param A Matrix. + * @param iter Pointer to current Lanczos iteration. On exit, the + * variable is set equal to the final Lanczos iteration. + * @param maxIter Maximum Lanczos iteration. This function will + * perform a maximum of maxIter-*iter iterations. + * @param shift Matrix shift. + * @param tol Convergence tolerance. Lanczos iteration will + * terminate when the residual norm (i.e. entry in beta_host) is + * less than tol. + * @param reorthogonalize Whether to reorthogonalize Lanczos + * vectors. + * @param alpha_host (Output, host memory, maxIter entries) + * Diagonal entries of Lanczos system. + * @param beta_host (Output, host memory, maxIter entries) + * Off-diagonal entries of Lanczos system. + * @param lanczosVecs_dev (Input/output, device memory, + * n*(maxIter+1) entries) Lanczos vectors. Vectors are stored as + * columns of a column-major matrix with dimensions + * n x (maxIter+1). + * @param work_dev (Output, device memory, maxIter entries) + * Workspace. Not needed if full reorthogonalization is disabled. + * @return Zero if successful. Otherwise non-zero. + */ +template +int performLanczosIteration(handle_t const& handle, + spectral::matrix::sparse_matrix_t const* A, + index_type_t* iter, + index_type_t maxIter, + value_type_t shift, + value_type_t tol, + bool reorthogonalize, + value_type_t* __restrict__ alpha_host, + value_type_t* __restrict__ beta_host, + value_type_t* __restrict__ lanczosVecs_dev, + value_type_t* __restrict__ work_dev) +{ + // ------------------------------------------------------- + // Variable declaration + // ------------------------------------------------------- + + // Useful variables + constexpr value_type_t one = 1; + constexpr value_type_t negOne = -1; + constexpr value_type_t zero = 0; + value_type_t alpha; + + auto cublas_h = handle.get_cublas_handle(); + auto stream = handle.get_stream(); + + RAFT_EXPECTS(A != nullptr, "Null matrix pointer."); + + index_type_t n = A->nrows_; + + // ------------------------------------------------------- + // Compute second Lanczos vector + // ------------------------------------------------------- + if (*iter <= 0) { + *iter = 1; + + // Apply matrix + if (shift != 0) + RAFT_CUDA_TRY(cudaMemcpyAsync(lanczosVecs_dev + n, + lanczosVecs_dev, + n * sizeof(value_type_t), + cudaMemcpyDeviceToDevice, + stream)); + A->mv(1, lanczosVecs_dev, shift, lanczosVecs_dev + n); + + // Orthogonalize Lanczos vector + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasdot( + cublas_h, n, lanczosVecs_dev, 1, lanczosVecs_dev + IDX(0, 1, n), 1, alpha_host, stream)); + + alpha = -alpha_host[0]; + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasaxpy( + cublas_h, n, &alpha, lanczosVecs_dev, 1, lanczosVecs_dev + IDX(0, 1, n), 1, stream)); + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasnrm2( + cublas_h, n, lanczosVecs_dev + IDX(0, 1, n), 1, beta_host, stream)); + + // Check if Lanczos has converged + if (beta_host[0] <= tol) return 0; + + // Normalize Lanczos vector + alpha = 1 / beta_host[0]; + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasscal( + cublas_h, n, &alpha, lanczosVecs_dev + IDX(0, 1, n), 1, stream)); + } + + // ------------------------------------------------------- + // Compute remaining Lanczos vectors + // ------------------------------------------------------- + + while (*iter < maxIter) { + ++(*iter); + + // Apply matrix + if (shift != 0) + RAFT_CUDA_TRY(cudaMemcpyAsync(lanczosVecs_dev + (*iter) * n, + lanczosVecs_dev + (*iter - 1) * n, + n * sizeof(value_type_t), + cudaMemcpyDeviceToDevice, + stream)); + A->mv(1, lanczosVecs_dev + IDX(0, *iter - 1, n), shift, lanczosVecs_dev + IDX(0, *iter, n)); + + // Full reorthogonalization + // "Twice is enough" algorithm per Kahan and Parlett + if (reorthogonalize) { + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasgemv(cublas_h, + CUBLAS_OP_T, + n, + *iter, + &one, + lanczosVecs_dev, + n, + lanczosVecs_dev + IDX(0, *iter, n), + 1, + &zero, + work_dev, + 1, + stream)); + + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasgemv(cublas_h, + CUBLAS_OP_N, + n, + *iter, + &negOne, + lanczosVecs_dev, + n, + work_dev, + 1, + &one, + lanczosVecs_dev + IDX(0, *iter, n), + 1, + stream)); + + RAFT_CUDA_TRY(cudaMemcpyAsync(alpha_host + (*iter - 1), + work_dev + (*iter - 1), + sizeof(value_type_t), + cudaMemcpyDeviceToHost, + stream)); + + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasgemv(cublas_h, + CUBLAS_OP_T, + n, + *iter, + &one, + lanczosVecs_dev, + n, + lanczosVecs_dev + IDX(0, *iter, n), + 1, + &zero, + work_dev, + 1, + stream)); + + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasgemv(cublas_h, + CUBLAS_OP_N, + n, + *iter, + &negOne, + lanczosVecs_dev, + n, + work_dev, + 1, + &one, + lanczosVecs_dev + IDX(0, *iter, n), + 1, + stream)); + } + + // Orthogonalization with 3-term recurrence relation + else { + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasdot(cublas_h, + n, + lanczosVecs_dev + IDX(0, *iter - 1, n), + 1, + lanczosVecs_dev + IDX(0, *iter, n), + 1, + alpha_host + (*iter - 1), + stream)); + + auto alpha = -alpha_host[*iter - 1]; + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasaxpy(cublas_h, + n, + &alpha, + lanczosVecs_dev + IDX(0, *iter - 1, n), + 1, + lanczosVecs_dev + IDX(0, *iter, n), + 1, + stream)); + + alpha = -beta_host[*iter - 2]; + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasaxpy(cublas_h, + n, + &alpha, + lanczosVecs_dev + IDX(0, *iter - 2, n), + 1, + lanczosVecs_dev + IDX(0, *iter, n), + 1, + stream)); + } + + // Compute residual + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasnrm2( + cublas_h, n, lanczosVecs_dev + IDX(0, *iter, n), 1, beta_host + *iter - 1, stream)); + + // Check if Lanczos has converged + if (beta_host[*iter - 1] <= tol) break; + + // Normalize Lanczos vector + alpha = 1 / beta_host[*iter - 1]; + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasscal( + cublas_h, n, &alpha, lanczosVecs_dev + IDX(0, *iter, n), 1, stream)); + } + + handle.sync_stream(stream); + + return 0; +} + +/** + * @brief Find Householder transform for 3-dimensional system + * Given an input vector v=[x,y,z]', this function finds a + * Householder transform P such that P*v is a multiple of + * e_1=[1,0,0]'. The input vector v is overwritten with the + * Householder vector such that P=I-2*v*v'. + * @tparam index_type_t the type of data used for indexing. + * @tparam value_type_t the type of data used for weights, distances. + * @param v (Input/output, host memory, 3 entries) Input + * 3-dimensional vector. On exit, the vector is set to the + * Householder vector. + * @param Pv (Output, host memory, 1 entry) First entry of P*v + * (here v is the input vector). Either equal to ||v||_2 or + * -||v||_2. + * @param P (Output, host memory, 9 entries) Householder transform + * matrix. Matrix dimensions are 3 x 3. + */ +template +static void findHouseholder3(value_type_t* v, value_type_t* Pv, value_type_t* P) +{ + // Compute norm of vector + *Pv = std::sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]); + + // Choose whether to reflect to e_1 or -e_1 + // This choice avoids catastrophic cancellation + if (v[0] >= 0) *Pv = -(*Pv); + v[0] -= *Pv; + + // Normalize Householder vector + value_type_t normHouseholder = std::sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]); + if (normHouseholder != 0) { + v[0] /= normHouseholder; + v[1] /= normHouseholder; + v[2] /= normHouseholder; + } else { + v[0] = 0; + v[1] = 0; + v[2] = 0; + } + + // Construct Householder matrix + index_type_t i, j; + for (j = 0; j < 3; ++j) + for (i = 0; i < 3; ++i) + P[IDX(i, j, 3)] = -2 * v[i] * v[j]; + for (i = 0; i < 3; ++i) + P[IDX(i, i, 3)] += 1; +} + +/** + * @brief Apply 3-dimensional Householder transform to 4 x 4 matrix + * The Householder transform is pre-applied to the top three rows + * of the matrix and post-applied to the left three columns. The + * 4 x 4 matrix is intended to contain the bulge that is produced + * in the Francis QR algorithm. + * @tparam index_type_t the type of data used for indexing. + * @tparam value_type_t the type of data used for weights, distances. + * @param v (Input, host memory, 3 entries) Householder vector. + * @param A (Input/output, host memory, 16 entries) 4 x 4 matrix. + */ +template +static void applyHouseholder3(const value_type_t* v, value_type_t* A) +{ + // Loop indices + index_type_t i, j; + // Dot product between Householder vector and matrix row/column + value_type_t vDotA; + + // Pre-apply Householder transform + for (j = 0; j < 4; ++j) { + vDotA = 0; + for (i = 0; i < 3; ++i) + vDotA += v[i] * A[IDX(i, j, 4)]; + for (i = 0; i < 3; ++i) + A[IDX(i, j, 4)] -= 2 * v[i] * vDotA; + } + + // Post-apply Householder transform + for (i = 0; i < 4; ++i) { + vDotA = 0; + for (j = 0; j < 3; ++j) + vDotA += A[IDX(i, j, 4)] * v[j]; + for (j = 0; j < 3; ++j) + A[IDX(i, j, 4)] -= 2 * vDotA * v[j]; + } +} + +/** + * @brief Perform one step of Francis QR algorithm + * Equivalent to two steps of the classical QR algorithm on a + * tridiagonal matrix. + * @tparam index_type_t the type of data used for indexing. + * @tparam value_type_t the type of data used for weights, distances. + * @param n Matrix dimension. + * @param shift1 QR algorithm shift. + * @param shift2 QR algorithm shift. + * @param alpha (Input/output, host memory, n entries) Diagonal + * entries of tridiagonal matrix. + * @param beta (Input/output, host memory, n-1 entries) + * Off-diagonal entries of tridiagonal matrix. + * @param V (Input/output, host memory, n*n entries) Orthonormal + * transforms from previous steps of QR algorithm. Matrix + * dimensions are n x n. On exit, the orthonormal transform from + * this Francis QR step is post-applied to the matrix. + * @param work (Output, host memory, 3*n entries) Workspace. + * @return Zero if successful. Otherwise non-zero. + */ +template +static int francisQRIteration(index_type_t n, + value_type_t shift1, + value_type_t shift2, + value_type_t* alpha, + value_type_t* beta, + value_type_t* V, + value_type_t* work) +{ + // ------------------------------------------------------- + // Variable declaration + // ------------------------------------------------------- + + // Temporary storage of 4x4 bulge and Householder vector + value_type_t bulge[16]; + + // Householder vector + value_type_t householder[3]; + // Householder matrix + value_type_t householderMatrix[3 * 3]; + + // Shifts are roots of the polynomial p(x)=x^2+b*x+c + value_type_t b = -shift1 - shift2; + value_type_t c = shift1 * shift2; + + // Loop indices + index_type_t i, j, pos; + // Temporary variable + value_type_t temp; + + // ------------------------------------------------------- + // Implementation + // ------------------------------------------------------- + + // Compute initial Householder transform + householder[0] = alpha[0] * alpha[0] + beta[0] * beta[0] + b * alpha[0] + c; + householder[1] = beta[0] * (alpha[0] + alpha[1] + b); + householder[2] = beta[0] * beta[1]; + findHouseholder3(householder, &temp, householderMatrix); + + // Apply initial Householder transform to create bulge + memset(bulge, 0, 16 * sizeof(value_type_t)); + for (i = 0; i < 4; ++i) + bulge[IDX(i, i, 4)] = alpha[i]; + for (i = 0; i < 3; ++i) { + bulge[IDX(i + 1, i, 4)] = beta[i]; + bulge[IDX(i, i + 1, 4)] = beta[i]; + } + applyHouseholder3(householder, bulge); + Lapack::gemm(false, false, n, 3, 3, 1, V, n, householderMatrix, 3, 0, work, n); + memcpy(V, work, 3 * n * sizeof(value_type_t)); + + // Chase bulge to bottom-right of matrix with Householder transforms + for (pos = 0; pos < n - 4; ++pos) { + // Move to next position + alpha[pos] = bulge[IDX(0, 0, 4)]; + householder[0] = bulge[IDX(1, 0, 4)]; + householder[1] = bulge[IDX(2, 0, 4)]; + householder[2] = bulge[IDX(3, 0, 4)]; + for (j = 0; j < 3; ++j) + for (i = 0; i < 3; ++i) + bulge[IDX(i, j, 4)] = bulge[IDX(i + 1, j + 1, 4)]; + bulge[IDX(3, 0, 4)] = 0; + bulge[IDX(3, 1, 4)] = 0; + bulge[IDX(3, 2, 4)] = beta[pos + 3]; + bulge[IDX(0, 3, 4)] = 0; + bulge[IDX(1, 3, 4)] = 0; + bulge[IDX(2, 3, 4)] = beta[pos + 3]; + bulge[IDX(3, 3, 4)] = alpha[pos + 4]; + + // Apply Householder transform + findHouseholder3(householder, beta + pos, householderMatrix); + applyHouseholder3(householder, bulge); + Lapack::gemm( + false, false, n, 3, 3, 1, V + IDX(0, pos + 1, n), n, householderMatrix, 3, 0, work, n); + memcpy(V + IDX(0, pos + 1, n), work, 3 * n * sizeof(value_type_t)); + } + + // Apply penultimate Householder transform + // Values in the last row and column are zero + alpha[n - 4] = bulge[IDX(0, 0, 4)]; + householder[0] = bulge[IDX(1, 0, 4)]; + householder[1] = bulge[IDX(2, 0, 4)]; + householder[2] = bulge[IDX(3, 0, 4)]; + for (j = 0; j < 3; ++j) + for (i = 0; i < 3; ++i) + bulge[IDX(i, j, 4)] = bulge[IDX(i + 1, j + 1, 4)]; + bulge[IDX(3, 0, 4)] = 0; + bulge[IDX(3, 1, 4)] = 0; + bulge[IDX(3, 2, 4)] = 0; + bulge[IDX(0, 3, 4)] = 0; + bulge[IDX(1, 3, 4)] = 0; + bulge[IDX(2, 3, 4)] = 0; + bulge[IDX(3, 3, 4)] = 0; + findHouseholder3(householder, beta + n - 4, householderMatrix); + applyHouseholder3(householder, bulge); + Lapack::gemm( + false, false, n, 3, 3, 1, V + IDX(0, n - 3, n), n, householderMatrix, 3, 0, work, n); + memcpy(V + IDX(0, n - 3, n), work, 3 * n * sizeof(value_type_t)); + + // Apply final Householder transform + // Values in the last two rows and columns are zero + alpha[n - 3] = bulge[IDX(0, 0, 4)]; + householder[0] = bulge[IDX(1, 0, 4)]; + householder[1] = bulge[IDX(2, 0, 4)]; + householder[2] = 0; + for (j = 0; j < 3; ++j) + for (i = 0; i < 3; ++i) + bulge[IDX(i, j, 4)] = bulge[IDX(i + 1, j + 1, 4)]; + findHouseholder3(householder, beta + n - 3, householderMatrix); + applyHouseholder3(householder, bulge); + Lapack::gemm( + false, false, n, 2, 2, 1, V + IDX(0, n - 2, n), n, householderMatrix, 3, 0, work, n); + memcpy(V + IDX(0, n - 2, n), work, 2 * n * sizeof(value_type_t)); + + // Bulge has been eliminated + alpha[n - 2] = bulge[IDX(0, 0, 4)]; + alpha[n - 1] = bulge[IDX(1, 1, 4)]; + beta[n - 2] = bulge[IDX(1, 0, 4)]; + + return 0; +} + +/** + * @brief Perform implicit restart of Lanczos algorithm + * Shifts are Chebyshev nodes of unwanted region of matrix spectrum. + * @tparam index_type_t the type of data used for indexing. + * @tparam value_type_t the type of data used for weights, distances. + * @param handle the raft handle. + * @param n Matrix dimension. + * @param iter Current Lanczos iteration. + * @param iter_new Lanczos iteration after restart. + * @param shiftUpper Pointer (host memory) to upper bound for unwanted + * region. Value is ignored if less than *shiftLower. If a + * stronger upper bound has been found, the value is updated on + * exit. + * @param shiftLower Pointer (host memory) to lower bound for unwanted + * region. Value is ignored if greater than *shiftUpper. If a + * stronger lower bound has been found, the value is updated on + * exit. + * @param alpha_host (Input/output, host memory, iter entries) + * Diagonal entries of Lanczos system. + * @param beta_host (Input/output, host memory, iter entries) + * Off-diagonal entries of Lanczos system. + * @param V_host (Output, host memory, iter*iter entries) + * Orthonormal transform used to obtain restarted system. Matrix + * dimensions are iter x iter. + * @param work_host (Output, host memory, 4*iter entries) + * Workspace. + * @param lanczosVecs_dev (Input/output, device memory, n*(iter+1) + * entries) Lanczos vectors. Vectors are stored as columns of a + * column-major matrix with dimensions n x (iter+1). + * @param work_dev (Output, device memory, (n+iter)*iter entries) + * Workspace. + * @param smallest_eig specifies whether smallest (true) or largest + * (false) eigenvalues are to be calculated. + * @return error flag. + */ +template +static int lanczosRestart(handle_t const& handle, + index_type_t n, + index_type_t iter, + index_type_t iter_new, + value_type_t* shiftUpper, + value_type_t* shiftLower, + value_type_t* __restrict__ alpha_host, + value_type_t* __restrict__ beta_host, + value_type_t* __restrict__ V_host, + value_type_t* __restrict__ work_host, + value_type_t* __restrict__ lanczosVecs_dev, + value_type_t* __restrict__ work_dev, + bool smallest_eig) +{ + // ------------------------------------------------------- + // Variable declaration + // ------------------------------------------------------- + + // Useful constants + constexpr value_type_t zero = 0; + constexpr value_type_t one = 1; + + auto cublas_h = handle.get_cublas_handle(); + auto stream = handle.get_stream(); + + // Loop index + index_type_t i; + + // Number of implicit restart steps + // Assumed to be even since each call to Francis algorithm is + // equivalent to two calls of QR algorithm + index_type_t restartSteps = iter - iter_new; + + // Ritz values from Lanczos method + value_type_t* ritzVals_host = work_host + 3 * iter; + // Shifts for implicit restart + value_type_t* shifts_host; + + // Orthonormal matrix for similarity transform + value_type_t* V_dev = work_dev + n * iter; + + // ------------------------------------------------------- + // Implementation + // ------------------------------------------------------- + + // Compute Ritz values + memcpy(ritzVals_host, alpha_host, iter * sizeof(value_type_t)); + memcpy(work_host, beta_host, (iter - 1) * sizeof(value_type_t)); + Lapack::sterf(iter, ritzVals_host, work_host); + + // Debug: Print largest eigenvalues + // for (int i = iter-iter_new; i < iter; ++i) + // std::cout <<*(ritzVals_host+i)<< " "; + // std::cout < *shiftUpper) { + *shiftUpper = ritzVals_host[iter - 1]; + *shiftLower = ritzVals_host[iter_new]; + } else { + *shiftUpper = std::max(*shiftUpper, ritzVals_host[iter - 1]); + *shiftLower = std::min(*shiftLower, ritzVals_host[iter_new]); + } + } else { + if (*shiftLower > *shiftUpper) { + *shiftUpper = ritzVals_host[iter - iter_new - 1]; + *shiftLower = ritzVals_host[0]; + } else { + *shiftUpper = std::max(*shiftUpper, ritzVals_host[iter - iter_new - 1]); + *shiftLower = std::min(*shiftLower, ritzVals_host[0]); + } + } + + // Calculate Chebyshev nodes as shifts + shifts_host = ritzVals_host; + for (i = 0; i < restartSteps; ++i) { + shifts_host[i] = cos((i + 0.5) * static_cast(M_PI) / restartSteps); + shifts_host[i] *= 0.5 * ((*shiftUpper) - (*shiftLower)); + shifts_host[i] += 0.5 * ((*shiftUpper) + (*shiftLower)); + } + + // Apply Francis QR algorithm to implicitly restart Lanczos + for (i = 0; i < restartSteps; i += 2) + if (francisQRIteration( + iter, shifts_host[i], shifts_host[i + 1], alpha_host, beta_host, V_host, work_host)) + WARNING("error in implicitly shifted QR algorithm"); + + // Obtain new residual + RAFT_CUDA_TRY(cudaMemcpyAsync( + V_dev, V_host, iter * iter * sizeof(value_type_t), cudaMemcpyHostToDevice, stream)); + + beta_host[iter - 1] = beta_host[iter - 1] * V_host[IDX(iter - 1, iter_new - 1, iter)]; + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasgemv(cublas_h, + CUBLAS_OP_N, + n, + iter, + beta_host + iter_new - 1, + lanczosVecs_dev, + n, + V_dev + IDX(0, iter_new, iter), + 1, + beta_host + iter - 1, + lanczosVecs_dev + IDX(0, iter, n), + 1, + stream)); + + // Obtain new Lanczos vectors + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasgemm(cublas_h, + CUBLAS_OP_N, + CUBLAS_OP_N, + n, + iter_new, + iter, + &one, + lanczosVecs_dev, + n, + V_dev, + iter, + &zero, + work_dev, + n, + stream)); + + RAFT_CUDA_TRY(cudaMemcpyAsync(lanczosVecs_dev, + work_dev, + n * iter_new * sizeof(value_type_t), + cudaMemcpyDeviceToDevice, + stream)); + + // Normalize residual to obtain new Lanczos vector + RAFT_CUDA_TRY(cudaMemcpyAsync(lanczosVecs_dev + IDX(0, iter_new, n), + lanczosVecs_dev + IDX(0, iter, n), + n * sizeof(value_type_t), + cudaMemcpyDeviceToDevice, + stream)); + + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasnrm2( + cublas_h, n, lanczosVecs_dev + IDX(0, iter_new, n), 1, beta_host + iter_new - 1, stream)); + + auto h_beta = 1 / beta_host[iter_new - 1]; + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasscal( + cublas_h, n, &h_beta, lanczosVecs_dev + IDX(0, iter_new, n), 1, stream)); + + return 0; +} + +/** + * @brief Compute smallest eigenvectors of symmetric matrix + * Computes eigenvalues and eigenvectors that are least + * positive. If matrix is positive definite or positive + * semidefinite, the computed eigenvalues are smallest in + * magnitude. + * The largest eigenvalue is estimated by performing several + * Lanczos iterations. An implicitly restarted Lanczos method is + * then applied to A+s*I, where s is negative the largest + * eigenvalue. + * @tparam index_type_t the type of data used for indexing. + * @tparam value_type_t the type of data used for weights, distances. + * @param handle the raft handle. + * @param A Matrix. + * @param nEigVecs Number of eigenvectors to compute. + * @param maxIter Maximum number of Lanczos steps. Does not include + * Lanczos steps used to estimate largest eigenvalue. + * @param restartIter Maximum size of Lanczos system before + * performing an implicit restart. Should be at least 4. + * @param tol Convergence tolerance. Lanczos iteration will + * terminate when the residual norm is less than tol*theta, where + * theta is an estimate for the smallest unwanted eigenvalue + * (i.e. the (nEigVecs+1)th smallest eigenvalue). + * @param reorthogonalize Whether to reorthogonalize Lanczos + * vectors. + * @param effIter On exit, pointer to final size of Lanczos system. + * @param totalIter On exit, pointer to total number of Lanczos + * iterations performed. Does not include Lanczos steps used to + * estimate largest eigenvalue. + * @param shift On exit, pointer to matrix shift (estimate for + * largest eigenvalue). + * @param alpha_host (Output, host memory, restartIter entries) + * Diagonal entries of Lanczos system. + * @param beta_host (Output, host memory, restartIter entries) + * Off-diagonal entries of Lanczos system. + * @param lanczosVecs_dev (Output, device memory, n*(restartIter+1) + * entries) Lanczos vectors. Vectors are stored as columns of a + * column-major matrix with dimensions n x (restartIter+1). + * @param work_dev (Output, device memory, + * (n+restartIter)*restartIter entries) Workspace. + * @param eigVals_dev (Output, device memory, nEigVecs entries) + * Largest eigenvalues of matrix. + * @param eigVecs_dev (Output, device memory, n*nEigVecs entries) + * Eigenvectors corresponding to smallest eigenvalues of + * matrix. Vectors are stored as columns of a column-major matrix + * with dimensions n x nEigVecs. + * @param seed random seed. + * @return error flag. + */ +template +int computeSmallestEigenvectors( + handle_t const& handle, + spectral::matrix::sparse_matrix_t const* A, + index_type_t nEigVecs, + index_type_t maxIter, + index_type_t restartIter, + value_type_t tol, + bool reorthogonalize, + index_type_t* effIter, + index_type_t* totalIter, + value_type_t* shift, + value_type_t* __restrict__ alpha_host, + value_type_t* __restrict__ beta_host, + value_type_t* __restrict__ lanczosVecs_dev, + value_type_t* __restrict__ work_dev, + value_type_t* __restrict__ eigVals_dev, + value_type_t* __restrict__ eigVecs_dev, + unsigned long long seed) +{ + // Useful constants + constexpr value_type_t one = 1; + constexpr value_type_t zero = 0; + + // Matrix dimension + index_type_t n = A->nrows_; + + // Shift for implicit restart + value_type_t shiftUpper; + value_type_t shiftLower; + + // Lanczos iteration counters + index_type_t maxIter_curr = restartIter; // Maximum size of Lanczos system + + // Status flags + int status; + + // Loop index + index_type_t i; + + // Host memory + value_type_t* Z_host; // Eigenvectors in Lanczos basis + value_type_t* work_host; // Workspace + + // ------------------------------------------------------- + // Check that parameters are valid + // ------------------------------------------------------- + RAFT_EXPECTS(nEigVecs > 0 && nEigVecs <= n, "Invalid number of eigenvectors."); + RAFT_EXPECTS(restartIter > 0, "Invalid restartIter."); + RAFT_EXPECTS(tol > 0, "Invalid tolerance."); + RAFT_EXPECTS(maxIter >= nEigVecs, "Invalid maxIter."); + RAFT_EXPECTS(restartIter >= nEigVecs, "Invalid restartIter."); + + auto cublas_h = handle.get_cublas_handle(); + auto stream = handle.get_stream(); + + // ------------------------------------------------------- + // Variable initialization + // ------------------------------------------------------- + + // Total number of Lanczos iterations + *totalIter = 0; + + // Allocate host memory + std::vector Z_host_v(restartIter * restartIter); + std::vector work_host_v(4 * restartIter); + + Z_host = Z_host_v.data(); + work_host = work_host_v.data(); + + // Initialize cuBLAS + RAFT_CUBLAS_TRY( + raft::linalg::detail::cublassetpointermode(cublas_h, CUBLAS_POINTER_MODE_HOST, stream)); + + // ------------------------------------------------------- + // Compute largest eigenvalue to determine shift + // ------------------------------------------------------- + + // Random number generator + curandGenerator_t randGen; + // Initialize random number generator + curandCreateGenerator(&randGen, CURAND_RNG_PSEUDO_PHILOX4_32_10); + + curandSetPseudoRandomGeneratorSeed(randGen, seed); + + // Initialize initial Lanczos vector + curandGenerateNormalX(randGen, lanczosVecs_dev, n + n % 2, zero, one); + value_type_t normQ1; + RAFT_CUBLAS_TRY( + raft::linalg::detail::cublasnrm2(cublas_h, n, lanczosVecs_dev, 1, &normQ1, stream)); + + auto h_val = 1 / normQ1; + RAFT_CUBLAS_TRY( + raft::linalg::detail::cublasscal(cublas_h, n, &h_val, lanczosVecs_dev, 1, stream)); + + // Obtain tridiagonal matrix with Lanczos + *effIter = 0; + *shift = 0; + status = performLanczosIteration(handle, + A, + effIter, + maxIter_curr, + *shift, + 0.0, + reorthogonalize, + alpha_host, + beta_host, + lanczosVecs_dev, + work_dev); + if (status) WARNING("error in Lanczos iteration"); + + // Determine largest eigenvalue + + Lapack::sterf(*effIter, alpha_host, beta_host); + *shift = -alpha_host[*effIter - 1]; + + // ------------------------------------------------------- + // Compute eigenvectors of shifted matrix + // ------------------------------------------------------- + + // Obtain tridiagonal matrix with Lanczos + *effIter = 0; + + status = performLanczosIteration(handle, + A, + effIter, + maxIter_curr, + *shift, + 0, + reorthogonalize, + alpha_host, + beta_host, + lanczosVecs_dev, + work_dev); + if (status) WARNING("error in Lanczos iteration"); + *totalIter += *effIter; + + // Apply Lanczos method until convergence + shiftLower = 1; + shiftUpper = -1; + while (*totalIter < maxIter && beta_host[*effIter - 1] > tol * shiftLower) { + // Determine number of restart steps + // Number of steps must be even due to Francis algorithm + index_type_t iter_new = nEigVecs + 1; + if (restartIter - (maxIter - *totalIter) > nEigVecs + 1) + iter_new = restartIter - (maxIter - *totalIter); + if ((restartIter - iter_new) % 2) iter_new -= 1; + if (iter_new == *effIter) break; + + // Implicit restart of Lanczos method + status = lanczosRestart(handle, + n, + *effIter, + iter_new, + &shiftUpper, + &shiftLower, + alpha_host, + beta_host, + Z_host, + work_host, + lanczosVecs_dev, + work_dev, + true); + if (status) WARNING("error in Lanczos implicit restart"); + *effIter = iter_new; + + // Check for convergence + if (beta_host[*effIter - 1] <= tol * fabs(shiftLower)) break; + + // Proceed with Lanczos method + + status = performLanczosIteration(handle, + A, + effIter, + maxIter_curr, + *shift, + tol * fabs(shiftLower), + reorthogonalize, + alpha_host, + beta_host, + lanczosVecs_dev, + work_dev); + if (status) WARNING("error in Lanczos iteration"); + *totalIter += *effIter - iter_new; + } + + // Warning if Lanczos has failed to converge + if (beta_host[*effIter - 1] > tol * fabs(shiftLower)) { + WARNING("implicitly restarted Lanczos failed to converge"); + } + + // Solve tridiagonal system + memcpy(work_host + 2 * (*effIter), alpha_host, (*effIter) * sizeof(value_type_t)); + memcpy(work_host + 3 * (*effIter), beta_host, (*effIter - 1) * sizeof(value_type_t)); + Lapack::steqr('I', + *effIter, + work_host + 2 * (*effIter), + work_host + 3 * (*effIter), + Z_host, + *effIter, + work_host); + + // Obtain desired eigenvalues by applying shift + for (i = 0; i < *effIter; ++i) + work_host[i + 2 * (*effIter)] -= *shift; + for (i = *effIter; i < nEigVecs; ++i) + work_host[i + 2 * (*effIter)] = 0; + + // Copy results to device memory + RAFT_CUDA_TRY(cudaMemcpyAsync(eigVals_dev, + work_host + 2 * (*effIter), + nEigVecs * sizeof(value_type_t), + cudaMemcpyHostToDevice, + stream)); + + RAFT_CUDA_TRY(cudaMemcpyAsync(work_dev, + Z_host, + (*effIter) * nEigVecs * sizeof(value_type_t), + cudaMemcpyHostToDevice, + stream)); + CHECK_CUDA(stream); + + // Convert eigenvectors from Lanczos basis to standard basis + RAFT_CUBLAS_TRY(raft::linalg::detail::cublasgemm(cublas_h, + CUBLAS_OP_N, + CUBLAS_OP_N, + n, + nEigVecs, + *effIter, + &one, + lanczosVecs_dev, + n, + work_dev, + *effIter, + &zero, + eigVecs_dev, + n, + stream)); + + // Clean up and exit + curandDestroyGenerator(randGen); + return 0; +} + +template +int computeSmallestEigenvectors( + handle_t const& handle, + spectral::matrix::sparse_matrix_t const& A, + index_type_t nEigVecs, + index_type_t maxIter, + index_type_t restartIter, + value_type_t tol, + bool reorthogonalize, + index_type_t& iter, + value_type_t* __restrict__ eigVals_dev, + value_type_t* __restrict__ eigVecs_dev, + unsigned long long seed = 1234567) +{ + // Matrix dimension + index_type_t n = A.nrows_; + + // Check that parameters are valid + RAFT_EXPECTS(nEigVecs > 0 && nEigVecs <= n, "Invalid number of eigenvectors."); + RAFT_EXPECTS(restartIter > 0, "Invalid restartIter."); + RAFT_EXPECTS(tol > 0, "Invalid tolerance."); + RAFT_EXPECTS(maxIter >= nEigVecs, "Invalid maxIter."); + RAFT_EXPECTS(restartIter >= nEigVecs, "Invalid restartIter."); + + // Allocate memory + std::vector alpha_host_v(restartIter); + std::vector beta_host_v(restartIter); + + value_type_t* alpha_host = alpha_host_v.data(); + value_type_t* beta_host = beta_host_v.data(); + + spectral::matrix::vector_t lanczosVecs_dev(handle, n * (restartIter + 1)); + spectral::matrix::vector_t work_dev(handle, (n + restartIter) * restartIter); + + // Perform Lanczos method + index_type_t effIter; + value_type_t shift; + int status = computeSmallestEigenvectors(handle, + &A, + nEigVecs, + maxIter, + restartIter, + tol, + reorthogonalize, + &effIter, + &iter, + &shift, + alpha_host, + beta_host, + lanczosVecs_dev.raw(), + work_dev.raw(), + eigVals_dev, + eigVecs_dev, + seed); + + // Clean up and return + return status; +} + +/** + * @brief Compute largest eigenvectors of symmetric matrix + * Computes eigenvalues and eigenvectors that are least + * positive. If matrix is positive definite or positive + * semidefinite, the computed eigenvalues are largest in + * magnitude. + * The largest eigenvalue is estimated by performing several + * Lanczos iterations. An implicitly restarted Lanczos method is + * then applied. + * @tparam index_type_t the type of data used for indexing. + * @tparam value_type_t the type of data used for weights, distances. + * @param handle the raft handle. + * @param A Matrix. + * @param nEigVecs Number of eigenvectors to compute. + * @param maxIter Maximum number of Lanczos steps. + * @param restartIter Maximum size of Lanczos system before + * performing an implicit restart. Should be at least 4. + * @param tol Convergence tolerance. Lanczos iteration will + * terminate when the residual norm is less than tol*theta, where + * theta is an estimate for the largest unwanted eigenvalue + * (i.e. the (nEigVecs+1)th largest eigenvalue). + * @param reorthogonalize Whether to reorthogonalize Lanczos + * vectors. + * @param effIter On exit, pointer to final size of Lanczos system. + * @param totalIter On exit, pointer to total number of Lanczos + * iterations performed. + * @param alpha_host (Output, host memory, restartIter entries) + * Diagonal entries of Lanczos system. + * @param beta_host (Output, host memory, restartIter entries) + * Off-diagonal entries of Lanczos system. + * @param lanczosVecs_dev (Output, device memory, n*(restartIter+1) + * entries) Lanczos vectors. Vectors are stored as columns of a + * column-major matrix with dimensions n x (restartIter+1). + * @param work_dev (Output, device memory, + * (n+restartIter)*restartIter entries) Workspace. + * @param eigVals_dev (Output, device memory, nEigVecs entries) + * Largest eigenvalues of matrix. + * @param eigVecs_dev (Output, device memory, n*nEigVecs entries) + * Eigenvectors corresponding to largest eigenvalues of + * matrix. Vectors are stored as columns of a column-major matrix + * with dimensions n x nEigVecs. + * @param seed random seed. + * @return error flag. + */ +template +int computeLargestEigenvectors( + handle_t const& handle, + spectral::matrix::sparse_matrix_t const* A, + index_type_t nEigVecs, + index_type_t maxIter, + index_type_t restartIter, + value_type_t tol, + bool reorthogonalize, + index_type_t* effIter, + index_type_t* totalIter, + value_type_t* __restrict__ alpha_host, + value_type_t* __restrict__ beta_host, + value_type_t* __restrict__ lanczosVecs_dev, + value_type_t* __restrict__ work_dev, + value_type_t* __restrict__ eigVals_dev, + value_type_t* __restrict__ eigVecs_dev, + unsigned long long seed) +{ + // Useful constants + constexpr value_type_t one = 1; + constexpr value_type_t zero = 0; + + // Matrix dimension + index_type_t n = A->nrows_; + + // Lanczos iteration counters + index_type_t maxIter_curr = restartIter; // Maximum size of Lanczos system + + // Status flags + int status; + + // Loop index + index_type_t i; + + // Host memory + value_type_t* Z_host; // Eigenvectors in Lanczos basis + value_type_t* work_host; // Workspace + + // ------------------------------------------------------- + // Check that LAPACK is enabled + // ------------------------------------------------------- + // Lapack::check_lapack_enabled(); + + // ------------------------------------------------------- + // Check that parameters are valid + // ------------------------------------------------------- + RAFT_EXPECTS(nEigVecs > 0 && nEigVecs <= n, "Invalid number of eigenvectors."); + RAFT_EXPECTS(restartIter > 0, "Invalid restartIter."); + RAFT_EXPECTS(tol > 0, "Invalid tolerance."); + RAFT_EXPECTS(maxIter >= nEigVecs, "Invalid maxIter."); + RAFT_EXPECTS(restartIter >= nEigVecs, "Invalid restartIter."); + + auto cublas_h = handle.get_cublas_handle(); + auto stream = handle.get_stream(); + + // ------------------------------------------------------- + // Variable initialization + // ------------------------------------------------------- + + // Total number of Lanczos iterations + *totalIter = 0; + + // Allocate host memory + std::vector Z_host_v(restartIter * restartIter); + std::vector work_host_v(4 * restartIter); + + Z_host = Z_host_v.data(); + work_host = work_host_v.data(); + + // Initialize cuBLAS + RAFT_CUBLAS_TRY( + raft::linalg::detail::cublassetpointermode(cublas_h, CUBLAS_POINTER_MODE_HOST, stream)); + + // ------------------------------------------------------- + // Compute largest eigenvalue + // ------------------------------------------------------- + + // Random number generator + curandGenerator_t randGen; + // Initialize random number generator + curandCreateGenerator(&randGen, CURAND_RNG_PSEUDO_PHILOX4_32_10); + curandSetPseudoRandomGeneratorSeed(randGen, seed); + // Initialize initial Lanczos vector + curandGenerateNormalX(randGen, lanczosVecs_dev, n + n % 2, zero, one); + value_type_t normQ1; + RAFT_CUBLAS_TRY( + raft::linalg::detail::cublasnrm2(cublas_h, n, lanczosVecs_dev, 1, &normQ1, stream)); + + auto h_val = 1 / normQ1; + RAFT_CUBLAS_TRY( + raft::linalg::detail::cublasscal(cublas_h, n, &h_val, lanczosVecs_dev, 1, stream)); + + // Obtain tridiagonal matrix with Lanczos + *effIter = 0; + value_type_t shift_val = 0.0; + value_type_t* shift = &shift_val; + + status = performLanczosIteration(handle, + A, + effIter, + maxIter_curr, + *shift, + 0, + reorthogonalize, + alpha_host, + beta_host, + lanczosVecs_dev, + work_dev); + if (status) WARNING("error in Lanczos iteration"); + *totalIter += *effIter; + + // Apply Lanczos method until convergence + value_type_t shiftLower = 1; + value_type_t shiftUpper = -1; + while (*totalIter < maxIter && beta_host[*effIter - 1] > tol * shiftLower) { + // Determine number of restart steps + // Number of steps must be even due to Francis algorithm + index_type_t iter_new = nEigVecs + 1; + if (restartIter - (maxIter - *totalIter) > nEigVecs + 1) + iter_new = restartIter - (maxIter - *totalIter); + if ((restartIter - iter_new) % 2) iter_new -= 1; + if (iter_new == *effIter) break; + + // Implicit restart of Lanczos method + status = lanczosRestart(handle, + n, + *effIter, + iter_new, + &shiftUpper, + &shiftLower, + alpha_host, + beta_host, + Z_host, + work_host, + lanczosVecs_dev, + work_dev, + false); + if (status) WARNING("error in Lanczos implicit restart"); + *effIter = iter_new; + + // Check for convergence + if (beta_host[*effIter - 1] <= tol * fabs(shiftLower)) break; + + // Proceed with Lanczos method + + status = performLanczosIteration(handle, + A, + effIter, + maxIter_curr, + *shift, + tol * fabs(shiftLower), + reorthogonalize, + alpha_host, + beta_host, + lanczosVecs_dev, + work_dev); + if (status) WARNING("error in Lanczos iteration"); + *totalIter += *effIter - iter_new; + } + + // Warning if Lanczos has failed to converge + if (beta_host[*effIter - 1] > tol * fabs(shiftLower)) { + WARNING("implicitly restarted Lanczos failed to converge"); + } + for (int i = 0; i < restartIter; ++i) { + for (int j = 0; j < restartIter; ++j) + Z_host[i * restartIter + j] = 0; + } + // Solve tridiagonal system + memcpy(work_host + 2 * (*effIter), alpha_host, (*effIter) * sizeof(value_type_t)); + memcpy(work_host + 3 * (*effIter), beta_host, (*effIter - 1) * sizeof(value_type_t)); + Lapack::steqr('I', + *effIter, + work_host + 2 * (*effIter), + work_host + 3 * (*effIter), + Z_host, + *effIter, + work_host); + + // note: We need to pick the top nEigVecs eigenvalues + // but effItter can be larger than nEigVecs + // hence we add an offset for that case, because we want to access top nEigVecs eigenpairs in the + // matrix of size effIter. remember the array is sorted, so it is not needed for smallest + // eigenvalues case because the first ones are the smallest ones + + index_type_t top_eigenparis_idx_offset = *effIter - nEigVecs; + + // Debug : print nEigVecs largest eigenvalues + // for (int i = top_eigenparis_idx_offset; i < *effIter; ++i) + // std::cout <<*(work_host+(2*(*effIter)+i))<< " "; + // std::cout < +int computeLargestEigenvectors( + handle_t const& handle, + spectral::matrix::sparse_matrix_t const& A, + index_type_t nEigVecs, + index_type_t maxIter, + index_type_t restartIter, + value_type_t tol, + bool reorthogonalize, + index_type_t& iter, + value_type_t* __restrict__ eigVals_dev, + value_type_t* __restrict__ eigVecs_dev, + unsigned long long seed = 123456) +{ + // Matrix dimension + index_type_t n = A.nrows_; + + // Check that parameters are valid + RAFT_EXPECTS(nEigVecs > 0 && nEigVecs <= n, "Invalid number of eigenvectors."); + RAFT_EXPECTS(restartIter > 0, "Invalid restartIter."); + RAFT_EXPECTS(tol > 0, "Invalid tolerance."); + RAFT_EXPECTS(maxIter >= nEigVecs, "Invalid maxIter."); + RAFT_EXPECTS(restartIter >= nEigVecs, "Invalid restartIter."); + + // Allocate memory + std::vector alpha_host_v(restartIter); + std::vector beta_host_v(restartIter); + + value_type_t* alpha_host = alpha_host_v.data(); + value_type_t* beta_host = beta_host_v.data(); + + spectral::matrix::vector_t lanczosVecs_dev(handle, n * (restartIter + 1)); + spectral::matrix::vector_t work_dev(handle, (n + restartIter) * restartIter); + + // Perform Lanczos method + index_type_t effIter; + int status = computeLargestEigenvectors(handle, + &A, + nEigVecs, + maxIter, + restartIter, + tol, + reorthogonalize, + &effIter, + &iter, + alpha_host, + beta_host, + lanczosVecs_dev.raw(), + work_dev.raw(), + eigVals_dev, + eigVecs_dev, + seed); + + // Clean up and return + return status; +} + +} // namespace raft::sparse::solver::detail diff --git a/cpp/include/raft/sparse/mst/detail/mst_kernels.cuh b/cpp/include/raft/sparse/solver/detail/mst_kernels.cuh similarity index 98% rename from cpp/include/raft/sparse/mst/detail/mst_kernels.cuh rename to cpp/include/raft/sparse/solver/detail/mst_kernels.cuh index 36d426029b..916690be67 100644 --- a/cpp/include/raft/sparse/mst/detail/mst_kernels.cuh +++ b/cpp/include/raft/sparse/solver/detail/mst_kernels.cuh @@ -1,6 +1,6 @@ /* - * Copyright (c) 2020, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -17,15 +17,13 @@ #pragma once -#include "utils.cuh" +#include #include -#include +#include -namespace raft { -namespace mst { -namespace detail { +namespace raft::sparse::solver::detail { template __global__ void kernel_min_edge_per_vertex(const edge_t* offsets, @@ -332,6 +330,4 @@ __global__ void kernel_count_new_mst_edges(const vertex_t* mst_src, if (threadIdx.x == 0 && block_count > 0) { atomicAdd(mst_edge_count, block_count); } } -} // namespace detail -} // namespace mst -} // namespace raft +} // namespace raft::sparse::solver::detail diff --git a/cpp/include/raft/sparse/mst/detail/mst_solver_inl.cuh b/cpp/include/raft/sparse/solver/detail/mst_solver_inl.cuh similarity index 98% rename from cpp/include/raft/sparse/mst/detail/mst_solver_inl.cuh rename to cpp/include/raft/sparse/solver/detail/mst_solver_inl.cuh index fa8ecf2563..be8b696bca 100644 --- a/cpp/include/raft/sparse/mst/detail/mst_solver_inl.cuh +++ b/cpp/include/raft/sparse/solver/detail/mst_solver_inl.cuh @@ -18,10 +18,10 @@ #include -#include "mst_kernels.cuh" -#include "utils.cuh" +#include +#include -#include +#include #include #include @@ -43,8 +43,7 @@ #include -namespace raft { -namespace mst { +namespace raft::sparse::solver { // curand generator uniform inline curandStatus_t curand_generate_uniformX(curandGenerator_t generator, @@ -115,8 +114,7 @@ MST_solver::MST_solver(const raft::han } template -raft::Graph_COO -MST_solver::solve() +Graph_COO MST_solver::solve() { RAFT_EXPECTS(v > 0, "0 vertices"); RAFT_EXPECTS(e > 0, "0 edges"); @@ -409,6 +407,4 @@ void MST_solver::append_src_dst_pair( src_dst_zip_end, new_edges_functor()); } - -} // namespace mst -} // namespace raft +} // namespace raft::sparse::solver diff --git a/cpp/include/raft/sparse/mst/detail/utils.cuh b/cpp/include/raft/sparse/solver/detail/mst_utils.cuh similarity index 87% rename from cpp/include/raft/sparse/mst/detail/utils.cuh rename to cpp/include/raft/sparse/solver/detail/mst_utils.cuh index 94ddf4ed94..a33141192b 100644 --- a/cpp/include/raft/sparse/mst/detail/utils.cuh +++ b/cpp/include/raft/sparse/solver/detail/mst_utils.cuh @@ -20,9 +20,7 @@ #include #include -namespace raft { -namespace mst { -namespace detail { +namespace raft::sparse::solver::detail { template __device__ idx_t get_1D_idx() @@ -30,6 +28,4 @@ __device__ idx_t get_1D_idx() return blockIdx.x * blockDim.x + threadIdx.x; } -} // namespace detail -} // namespace mst -} // namespace raft +} // namespace raft::sparse::solver::detail diff --git a/cpp/include/raft/sparse/solver/lanczos.cuh b/cpp/include/raft/sparse/solver/lanczos.cuh new file mode 100644 index 0000000000..9b5301988a --- /dev/null +++ b/cpp/include/raft/sparse/solver/lanczos.cuh @@ -0,0 +1,160 @@ +/* + * Copyright (c) 2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#ifndef __LANCZOS_H +#define __LANCZOS_H + +#pragma once + +#include +#include + +namespace raft::sparse::solver { + +// ========================================================= +// Eigensolver +// ========================================================= + +/** + * @brief Compute smallest eigenvectors of symmetric matrix + * Computes eigenvalues and eigenvectors that are least + * positive. If matrix is positive definite or positive + * semidefinite, the computed eigenvalues are smallest in + * magnitude. + * The largest eigenvalue is estimated by performing several + * Lanczos iterations. An implicitly restarted Lanczos method is + * then applied to A+s*I, where s is negative the largest + * eigenvalue. + * @tparam index_type_t the type of data used for indexing. + * @tparam value_type_t the type of data used for weights, distances. + * @param handle the raft handle. + * @param A Matrix. + * @param nEigVecs Number of eigenvectors to compute. + * @param maxIter Maximum number of Lanczos steps. Does not include + * Lanczos steps used to estimate largest eigenvalue. + * @param restartIter Maximum size of Lanczos system before + * performing an implicit restart. Should be at least 4. + * @param tol Convergence tolerance. Lanczos iteration will + * terminate when the residual norm is less than tol*theta, where + * theta is an estimate for the smallest unwanted eigenvalue + * (i.e. the (nEigVecs+1)th smallest eigenvalue). + * @param reorthogonalize Whether to reorthogonalize Lanczos + * vectors. + * @param iter On exit, pointer to total number of Lanczos + * iterations performed. Does not include Lanczos steps used to + * estimate largest eigenvalue. + * @param eigVals_dev (Output, device memory, nEigVecs entries) + * Smallest eigenvalues of matrix. + * @param eigVecs_dev (Output, device memory, n*nEigVecs entries) + * Eigenvectors corresponding to smallest eigenvalues of + * matrix. Vectors are stored as columns of a column-major matrix + * with dimensions n x nEigVecs. + * @param seed random seed. + * @return error flag. + */ +template +int computeSmallestEigenvectors( + handle_t const& handle, + raft::spectral::matrix::sparse_matrix_t const& A, + index_type_t nEigVecs, + index_type_t maxIter, + index_type_t restartIter, + value_type_t tol, + bool reorthogonalize, + index_type_t& iter, + value_type_t* __restrict__ eigVals_dev, + value_type_t* __restrict__ eigVecs_dev, + unsigned long long seed = 1234567) +{ + return detail::computeSmallestEigenvectors(handle, + A, + nEigVecs, + maxIter, + restartIter, + tol, + reorthogonalize, + iter, + eigVals_dev, + eigVecs_dev, + seed); +} + +/** + * @brief Compute largest eigenvectors of symmetric matrix + * Computes eigenvalues and eigenvectors that are least + * positive. If matrix is positive definite or positive + * semidefinite, the computed eigenvalues are largest in + * magnitude. + * The largest eigenvalue is estimated by performing several + * Lanczos iterations. An implicitly restarted Lanczos method is + * then applied to A+s*I, where s is negative the largest + * eigenvalue. + * @tparam index_type_t the type of data used for indexing. + * @tparam value_type_t the type of data used for weights, distances. + * @param handle the raft handle. + * @param A Matrix. + * @param nEigVecs Number of eigenvectors to compute. + * @param maxIter Maximum number of Lanczos steps. Does not include + * Lanczos steps used to estimate largest eigenvalue. + * @param restartIter Maximum size of Lanczos system before + * performing an implicit restart. Should be at least 4. + * @param tol Convergence tolerance. Lanczos iteration will + * terminate when the residual norm is less than tol*theta, where + * theta is an estimate for the largest unwanted eigenvalue + * (i.e. the (nEigVecs+1)th largest eigenvalue). + * @param reorthogonalize Whether to reorthogonalize Lanczos + * vectors. + * @param iter On exit, pointer to total number of Lanczos + * iterations performed. Does not include Lanczos steps used to + * estimate largest eigenvalue. + * @param eigVals_dev (Output, device memory, nEigVecs entries) + * Largest eigenvalues of matrix. + * @param eigVecs_dev (Output, device memory, n*nEigVecs entries) + * Eigenvectors corresponding to largest eigenvalues of + * matrix. Vectors are stored as columns of a column-major matrix + * with dimensions n x nEigVecs. + * @param seed random seed. + * @return error flag. + */ +template +int computeLargestEigenvectors( + handle_t const& handle, + raft::spectral::matrix::sparse_matrix_t const& A, + index_type_t nEigVecs, + index_type_t maxIter, + index_type_t restartIter, + value_type_t tol, + bool reorthogonalize, + index_type_t& iter, + value_type_t* __restrict__ eigVals_dev, + value_type_t* __restrict__ eigVecs_dev, + unsigned long long seed = 123456) +{ + return detail::computeLargestEigenvectors(handle, + A, + nEigVecs, + maxIter, + restartIter, + tol, + reorthogonalize, + iter, + eigVals_dev, + eigVecs_dev, + seed); +} + +} // namespace raft::sparse::solver + +#endif \ No newline at end of file diff --git a/cpp/include/raft/sparse/solver/mst.cuh b/cpp/include/raft/sparse/solver/mst.cuh new file mode 100644 index 0000000000..33beeb1915 --- /dev/null +++ b/cpp/include/raft/sparse/solver/mst.cuh @@ -0,0 +1,50 @@ + +/* + * Copyright (c) 2020-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#pragma once + +#include + +namespace raft::sparse::solver { + +template +Graph_COO mst(const raft::handle_t& handle, + edge_t const* offsets, + vertex_t const* indices, + weight_t const* weights, + vertex_t const v, + edge_t const e, + vertex_t* color, + cudaStream_t stream, + bool symmetrize_output = true, + bool initialize_colors = true, + int iterations = 0) +{ + MST_solver mst_solver(handle, + offsets, + indices, + weights, + v, + e, + color, + stream, + symmetrize_output, + initialize_colors, + iterations); + return mst_solver.solve(); +} + +} // end namespace raft::sparse::solver diff --git a/cpp/include/raft/sparse/solver/mst_solver.cuh b/cpp/include/raft/sparse/solver/mst_solver.cuh new file mode 100644 index 0000000000..a10b74d77b --- /dev/null +++ b/cpp/include/raft/sparse/solver/mst_solver.cuh @@ -0,0 +1,102 @@ + +/* + * Copyright (c) 2020-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include + +namespace raft::sparse::solver { + +template +struct Graph_COO { + rmm::device_uvector src; + rmm::device_uvector dst; + rmm::device_uvector weights; + edge_t n_edges; + + Graph_COO(vertex_t size, cudaStream_t stream) + : src(size, stream), dst(size, stream), weights(size, stream) + { + } +}; + +template +class MST_solver { + public: + MST_solver(const raft::handle_t& handle_, + const edge_t* offsets_, + const vertex_t* indices_, + const weight_t* weights_, + const vertex_t v_, + const edge_t e_, + vertex_t* color_, + cudaStream_t stream_, + bool symmetrize_output_, + bool initialize_colors_, + int iterations_); + + Graph_COO solve(); + + ~MST_solver() {} + + private: + const raft::handle_t& handle; + cudaStream_t stream; + bool symmetrize_output, initialize_colors; + int iterations; + + // CSR + const edge_t* offsets; + const vertex_t* indices; + const weight_t* weights; + const vertex_t v; + const edge_t e; + + vertex_t max_blocks; + vertex_t max_threads; + vertex_t sm_count; + + vertex_t* color_index; // represent each supervertex as a color + rmm::device_uvector min_edge_color; // minimum incident edge weight per color + rmm::device_uvector new_mst_edge; // new minimum edge per vertex + rmm::device_uvector altered_weights; // weights to be used for mst + rmm::device_scalar mst_edge_count; // total number of edges added after every iteration + rmm::device_scalar + prev_mst_edge_count; // total number of edges up to the previous iteration + rmm::device_uvector mst_edge; // mst output - true if the edge belongs in mst + rmm::device_uvector next_color; // next iteration color + rmm::device_uvector color; // index of color that vertex points to + + // new src-dst pairs found per iteration + rmm::device_uvector temp_src; + rmm::device_uvector temp_dst; + rmm::device_uvector temp_weights; + + void label_prop(vertex_t* mst_src, vertex_t* mst_dst); + void min_edge_per_vertex(); + void min_edge_per_supervertex(); + void check_termination(); + void alteration(); + alteration_t alteration_max(); + void append_src_dst_pair(vertex_t* mst_src, vertex_t* mst_dst, weight_t* mst_weights); +}; + +} // namespace raft::sparse::solver + +#include diff --git a/cpp/include/raft/sparse/spatial/connect_components.cuh b/cpp/include/raft/sparse/spatial/connect_components.cuh new file mode 100644 index 0000000000..60c0bba1de --- /dev/null +++ b/cpp/include/raft/sparse/spatial/connect_components.cuh @@ -0,0 +1,79 @@ +/* + * Copyright (c) 2018-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include +#include + +namespace raft::sparse::spatial { + +template +using FixConnectivitiesRedOp = detail::FixConnectivitiesRedOp; + +/** + * Gets the number of unique components from array of + * colors or labels. This does not assume the components are + * drawn from a monotonically increasing set. + * @tparam value_idx + * @param[in] colors array of components + * @param[in] n_rows size of components array + * @param[in] stream cuda stream for which to order cuda operations + * @return total number of components + */ +template +value_idx get_n_components(value_idx* colors, size_t n_rows, cudaStream_t stream) +{ + return detail::get_n_components(colors, n_rows, stream); +} + +/** + * Connects the components of an otherwise unconnected knn graph + * by computing a 1-nn to neighboring components of each data point + * (e.g. component(nn) != component(self)) and reducing the results to + * include the set of smallest destination components for each source + * component. The result will not necessarily contain + * n_components^2 - n_components number of elements because many components + * will likely not be contained in the neighborhoods of 1-nns. + * @tparam value_idx + * @tparam value_t + * @param[in] handle raft handle + * @param[out] out output edge list containing nearest cross-component + * edges. + * @param[in] X original (row-major) dense matrix for which knn graph should be constructed. + * @param[in] orig_colors array containing component number for each row of X + * @param[in] n_rows number of rows in X + * @param[in] n_cols number of cols in X + * @param[in] reduction_op + * @param[in] metric + */ +template +void connect_components( + const raft::handle_t& handle, + raft::sparse::COO& out, + const value_t* X, + const value_idx* orig_colors, + size_t n_rows, + size_t n_cols, + red_op reduction_op, + raft::distance::DistanceType metric = raft::distance::DistanceType::L2SqrtExpanded) +{ + detail::connect_components(handle, out, X, orig_colors, n_rows, n_cols, reduction_op, metric); +} + +}; // end namespace raft::sparse::spatial \ No newline at end of file diff --git a/cpp/include/raft/sparse/selection/detail/connect_components.cuh b/cpp/include/raft/sparse/spatial/detail/connect_components.cuh similarity index 98% rename from cpp/include/raft/sparse/selection/detail/connect_components.cuh rename to cpp/include/raft/sparse/spatial/detail/connect_components.cuh index 92d06197cd..f515ab5739 100644 --- a/cpp/include/raft/sparse/selection/detail/connect_components.cuh +++ b/cpp/include/raft/sparse/spatial/detail/connect_components.cuh @@ -13,9 +13,11 @@ * See the License for the specific language governing permissions and * limitations under the License. */ +#pragma once #include +#include #include #include #include @@ -24,7 +26,7 @@ #include #include -#include +#include #include #include @@ -42,10 +44,7 @@ #include -namespace raft { -namespace linkage { -namespace detail { - +namespace raft::sparse::spatial::detail { /** * \brief A key identifier paired with a corresponding value * @@ -438,6 +437,4 @@ void connect_components( handle, min_edges.rows(), min_edges.cols(), min_edges.vals(), n_rows, n_rows, size, out); } -}; // end namespace detail -}; // end namespace linkage -}; // end namespace raft +}; // end namespace raft::sparse::spatial::detail diff --git a/cpp/include/raft/sparse/selection/detail/knn.cuh b/cpp/include/raft/sparse/spatial/detail/knn.cuh similarity index 98% rename from cpp/include/raft/sparse/selection/detail/knn.cuh rename to cpp/include/raft/sparse/spatial/detail/knn.cuh index b1dd6116e7..aa933cd680 100644 --- a/cpp/include/raft/sparse/selection/detail/knn.cuh +++ b/cpp/include/raft/sparse/spatial/detail/knn.cuh @@ -18,11 +18,11 @@ #include -#include -#include -#include +#include #include #include +#include +#include #include #include @@ -33,10 +33,7 @@ #include -namespace raft { -namespace sparse { -namespace selection { -namespace detail { +namespace raft::sparse::spatial::detail { template struct csr_batcher_t { @@ -428,7 +425,4 @@ class sparse_knn_t { const raft::handle_t& handle; }; -}; // namespace detail -}; // namespace selection -}; // namespace sparse -}; // namespace raft +}; // namespace raft::sparse::spatial::detail \ No newline at end of file diff --git a/cpp/include/raft/sparse/selection/detail/knn_graph.cuh b/cpp/include/raft/sparse/spatial/detail/knn_graph.cuh similarity index 94% rename from cpp/include/raft/sparse/selection/detail/knn_graph.cuh rename to cpp/include/raft/sparse/spatial/detail/knn_graph.cuh index 32b7fd3c63..1331393719 100644 --- a/cpp/include/raft/sparse/selection/detail/knn_graph.cuh +++ b/cpp/include/raft/sparse/spatial/detail/knn_graph.cuh @@ -16,15 +16,15 @@ #pragma once -#include -#include +#include +#include #include #include #include -#include +#include #include #include @@ -35,10 +35,7 @@ #include #include -namespace raft { -namespace sparse { -namespace selection { -namespace detail { +namespace raft::sparse::spatial::detail { /** * Fills indices array of pairwise distance array @@ -150,7 +147,4 @@ void knn_graph(const handle_t& handle, handle, rows.data(), indices.data(), data.data(), m, k, nnz, out); } -}; // namespace detail -}; // namespace selection -}; // namespace sparse -}; // end namespace raft +}; // namespace raft::sparse::spatial::detail diff --git a/cpp/include/raft/sparse/spatial/knn.cuh b/cpp/include/raft/sparse/spatial/knn.cuh new file mode 100644 index 0000000000..1e8a08ec96 --- /dev/null +++ b/cpp/include/raft/sparse/spatial/knn.cuh @@ -0,0 +1,93 @@ +/* + * Copyright (c) 2020-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#pragma once + +#include +#include +#include + +namespace raft::sparse::spatial { + +/** + * Search the sparse kNN for the k-nearest neighbors of a set of sparse query vectors + * using some distance implementation + * @param[in] idxIndptr csr indptr of the index matrix (size n_idx_rows + 1) + * @param[in] idxIndices csr column indices array of the index matrix (size n_idx_nnz) + * @param[in] idxData csr data array of the index matrix (size idxNNZ) + * @param[in] idxNNZ number of non-zeros for sparse index matrix + * @param[in] n_idx_rows number of data samples in index matrix + * @param[in] n_idx_cols + * @param[in] queryIndptr csr indptr of the query matrix (size n_query_rows + 1) + * @param[in] queryIndices csr indices array of the query matrix (size queryNNZ) + * @param[in] queryData csr data array of the query matrix (size queryNNZ) + * @param[in] queryNNZ number of non-zeros for sparse query matrix + * @param[in] n_query_rows number of data samples in query matrix + * @param[in] n_query_cols number of features in query matrix + * @param[out] output_indices dense matrix for output indices (size n_query_rows * k) + * @param[out] output_dists dense matrix for output distances (size n_query_rows * k) + * @param[in] k the number of neighbors to query + * @param[in] handle CUDA handle.get_stream() to order operations with respect to + * @param[in] batch_size_index maximum number of rows to use from index matrix per batch + * @param[in] batch_size_query maximum number of rows to use from query matrix per batch + * @param[in] metric distance metric/measure to use + * @param[in] metricArg potential argument for metric (currently unused) + */ +template +void brute_force_knn(const value_idx* idxIndptr, + const value_idx* idxIndices, + const value_t* idxData, + size_t idxNNZ, + int n_idx_rows, + int n_idx_cols, + const value_idx* queryIndptr, + const value_idx* queryIndices, + const value_t* queryData, + size_t queryNNZ, + int n_query_rows, + int n_query_cols, + value_idx* output_indices, + value_t* output_dists, + int k, + const raft::handle_t& handle, + size_t batch_size_index = 2 << 14, // approx 1M + size_t batch_size_query = 2 << 14, + raft::distance::DistanceType metric = raft::distance::DistanceType::L2Expanded, + float metricArg = 0) +{ + detail::sparse_knn_t(idxIndptr, + idxIndices, + idxData, + idxNNZ, + n_idx_rows, + n_idx_cols, + queryIndptr, + queryIndices, + queryData, + queryNNZ, + n_query_rows, + n_query_cols, + output_indices, + output_dists, + k, + handle, + batch_size_index, + batch_size_query, + metric, + metricArg) + .run(); +} + +}; // namespace raft::sparse::spatial diff --git a/cpp/include/raft/sparse/spatial/knn_graph.cuh b/cpp/include/raft/sparse/spatial/knn_graph.cuh new file mode 100644 index 0000000000..9694e6a293 --- /dev/null +++ b/cpp/include/raft/sparse/spatial/knn_graph.cuh @@ -0,0 +1,55 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include + +#include + +namespace raft::sparse::spatial { + +/** + * Constructs a (symmetrized) knn graph edge list from + * dense input vectors. + * + * Note: The resulting KNN graph is not guaranteed to be connected. + * + * @tparam value_idx + * @tparam value_t + * @param[in] handle raft handle + * @param[in] X dense matrix of input data samples and observations + * @param[in] m number of data samples (rows) in X + * @param[in] n number of observations (columns) in X + * @param[in] metric distance metric to use when constructing neighborhoods + * @param[out] out output edge list + * @param c + */ +template +void knn_graph(const handle_t& handle, + const value_t* X, + std::size_t m, + std::size_t n, + raft::distance::DistanceType metric, + raft::sparse::COO& out, + int c = 15) +{ + detail::knn_graph(handle, X, m, n, metric, out, c); +} + +}; // namespace raft::sparse::spatial diff --git a/cpp/include/raft/spatial/knn/ann_common.h b/cpp/include/raft/spatial/knn/ann_common.h index 45867dbfee..a0d79a1b77 100644 --- a/cpp/include/raft/spatial/knn/ann_common.h +++ b/cpp/include/raft/spatial/knn/ann_common.h @@ -23,7 +23,7 @@ #include "detail/processing.hpp" #include "ivf_flat_types.hpp" -#include +#include #include #include diff --git a/cpp/include/raft/spatial/knn/ann_types.hpp b/cpp/include/raft/spatial/knn/ann_types.hpp new file mode 100644 index 0000000000..6e9a00bc0c --- /dev/null +++ b/cpp/include/raft/spatial/knn/ann_types.hpp @@ -0,0 +1,47 @@ +/* + * Copyright (c) 2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +namespace raft::spatial::knn { + +/** The base for approximate KNN index structures. */ +struct index { +}; + +/** The base for KNN index parameters. */ +struct index_params { + /** Distance type. */ + raft::distance::DistanceType metric = distance::DistanceType::L2Expanded; + /** The argument used by some distance metrics. */ + float metric_arg = 2.0f; + /** + * Whether to add the dataset content to the index, i.e.: + * + * - `true` means the index is filled with the dataset vectors and ready to search after calling + * `build`. + * - `false` means `build` only trains the underlying model (e.g. quantizer or clustering), but + * the index is left empty; you'd need to call `extend` on the index afterwards to populate it. + */ + bool add_data_on_build = true; +}; + +struct search_params { +}; + +}; // namespace raft::spatial::knn diff --git a/cpp/include/raft/spatial/knn/ball_cover.cuh b/cpp/include/raft/spatial/knn/ball_cover.cuh index 62cd5aa45c..a354f6d5a4 100644 --- a/cpp/include/raft/spatial/knn/ball_cover.cuh +++ b/cpp/include/raft/spatial/knn/ball_cover.cuh @@ -20,10 +20,10 @@ #include -#include "ball_cover_common.h" +#include "ball_cover_types.hpp" #include "detail/ball_cover.cuh" #include "detail/ball_cover/common.cuh" -#include +#include #include namespace raft { diff --git a/cpp/include/raft/spatial/knn/ball_cover_common.h b/cpp/include/raft/spatial/knn/ball_cover_common.h index a2234abf26..9b775bbb82 100644 --- a/cpp/include/raft/spatial/knn/ball_cover_common.h +++ b/cpp/include/raft/spatial/knn/ball_cover_common.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,90 +16,8 @@ #pragma once -#include -#include -#include -#include +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the ball_cover_types.hpp version instead.") -namespace raft { -namespace spatial { -namespace knn { - -/** - * Stores raw index data points, sampled landmarks, the 1-nns of index points - * to their closest landmarks, and the ball radii of each landmark. This - * class is intended to be constructed once and reused across subsequent - * queries. - * @tparam value_idx - * @tparam value_t - * @tparam value_int - */ -template -class BallCoverIndex { - public: - explicit BallCoverIndex(const raft::handle_t& handle_, - const value_t* X_, - value_int m_, - value_int n_, - raft::distance::DistanceType metric_) - : handle(handle_), - X(X_), - m(m_), - n(n_), - metric(metric_), - /** - * the sqrt() here makes the sqrt(m)^2 a linear-time lower bound - * - * Total memory footprint of index: (2 * sqrt(m)) + (n * sqrt(m)) + (2 * m) - */ - n_landmarks(sqrt(m_)), - R_indptr(sqrt(m_) + 1, handle.get_stream()), - R_1nn_cols(m_, handle.get_stream()), - R_1nn_dists(m_, handle.get_stream()), - R_closest_landmark_dists(m_, handle.get_stream()), - R(sqrt(m_) * n_, handle.get_stream()), - R_radius(sqrt(m_), handle.get_stream()), - index_trained(false) - { - } - - value_idx* get_R_indptr() { return R_indptr.data(); } - value_idx* get_R_1nn_cols() { return R_1nn_cols.data(); } - value_t* get_R_1nn_dists() { return R_1nn_dists.data(); } - value_t* get_R_radius() { return R_radius.data(); } - value_t* get_R() { return R.data(); } - value_t* get_R_closest_landmark_dists() { return R_closest_landmark_dists.data(); } - const value_t* get_X() { return X; } - - bool is_index_trained() const { return index_trained; }; - - // This should only be set by internal functions - void set_index_trained() { index_trained = true; } - - const raft::handle_t& handle; - - const value_int m; - const value_int n; - const value_int n_landmarks; - - const value_t* X; - - raft::distance::DistanceType metric; - - private: - // CSR storing the neighborhoods for each data point - rmm::device_uvector R_indptr; - rmm::device_uvector R_1nn_cols; - rmm::device_uvector R_1nn_dists; - rmm::device_uvector R_closest_landmark_dists; - - rmm::device_uvector R_radius; - - rmm::device_uvector R; - - protected: - bool index_trained; -}; -} // namespace knn -} // namespace spatial -} // namespace raft +#include diff --git a/cpp/include/raft/spatial/knn/ball_cover_types.hpp b/cpp/include/raft/spatial/knn/ball_cover_types.hpp new file mode 100644 index 0000000000..9870217011 --- /dev/null +++ b/cpp/include/raft/spatial/knn/ball_cover_types.hpp @@ -0,0 +1,105 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include +#include + +namespace raft { +namespace spatial { +namespace knn { + +/** + * Stores raw index data points, sampled landmarks, the 1-nns of index points + * to their closest landmarks, and the ball radii of each landmark. This + * class is intended to be constructed once and reused across subsequent + * queries. + * @tparam value_idx + * @tparam value_t + * @tparam value_int + */ +template +class BallCoverIndex { + public: + explicit BallCoverIndex(const raft::handle_t& handle_, + const value_t* X_, + value_int m_, + value_int n_, + raft::distance::DistanceType metric_) + : handle(handle_), + X(X_), + m(m_), + n(n_), + metric(metric_), + /** + * the sqrt() here makes the sqrt(m)^2 a linear-time lower bound + * + * Total memory footprint of index: (2 * sqrt(m)) + (n * sqrt(m)) + (2 * m) + */ + n_landmarks(sqrt(m_)), + R_indptr(sqrt(m_) + 1, handle.get_stream()), + R_1nn_cols(m_, handle.get_stream()), + R_1nn_dists(m_, handle.get_stream()), + R_closest_landmark_dists(m_, handle.get_stream()), + R(sqrt(m_) * n_, handle.get_stream()), + R_radius(sqrt(m_), handle.get_stream()), + index_trained(false) + { + } + + value_idx* get_R_indptr() { return R_indptr.data(); } + value_idx* get_R_1nn_cols() { return R_1nn_cols.data(); } + value_t* get_R_1nn_dists() { return R_1nn_dists.data(); } + value_t* get_R_radius() { return R_radius.data(); } + value_t* get_R() { return R.data(); } + value_t* get_R_closest_landmark_dists() { return R_closest_landmark_dists.data(); } + const value_t* get_X() { return X; } + + bool is_index_trained() const { return index_trained; }; + + // This should only be set by internal functions + void set_index_trained() { index_trained = true; } + + const raft::handle_t& handle; + + const value_int m; + const value_int n; + const value_int n_landmarks; + + const value_t* X; + + raft::distance::DistanceType metric; + + private: + // CSR storing the neighborhoods for each data point + rmm::device_uvector R_indptr; + rmm::device_uvector R_1nn_cols; + rmm::device_uvector R_1nn_dists; + rmm::device_uvector R_closest_landmark_dists; + + rmm::device_uvector R_radius; + + rmm::device_uvector R; + + protected: + bool index_trained; +}; +} // namespace knn +} // namespace spatial +} // namespace raft diff --git a/cpp/include/raft/spatial/knn/common.hpp b/cpp/include/raft/spatial/knn/common.hpp index caaa951a66..5c444bf7a7 100644 --- a/cpp/include/raft/spatial/knn/common.hpp +++ b/cpp/include/raft/spatial/knn/common.hpp @@ -1,5 +1,5 @@ /* - * Copyright (c) 2022, NVIDIA CORPORATION. + * Copyright (c) 2021-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,35 +13,11 @@ * See the License for the specific language governing permissions and * limitations under the License. */ +/** + * This file is deprecated and will be removed in a future release. + * Please use the ann_types.hpp version instead. + */ #pragma once -#include - -namespace raft::spatial::knn { - -/** The base for approximate KNN index structures. */ -struct index { -}; - -/** The base for KNN index parameters. */ -struct index_params { - /** Distance type. */ - raft::distance::DistanceType metric = distance::DistanceType::L2Expanded; - /** The argument used by some distance metrics. */ - float metric_arg = 2.0f; - /** - * Whether to add the dataset content to the index, i.e.: - * - * - `true` means the index is filled with the dataset vectors and ready to search after calling - * `build`. - * - `false` means `build` only trains the underlying model (e.g. quantizer or clustering), but - * the index is left empty; you'd need to call `extend` on the index afterwards to populate it. - */ - bool add_data_on_build = true; -}; - -struct search_params { -}; - -}; // namespace raft::spatial::knn +#include diff --git a/cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh b/cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh index 0e1beaf07f..122306639f 100644 --- a/cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh +++ b/cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh @@ -21,12 +21,12 @@ #include #include #include -#include #include -#include +#include #include #include #include +#include #include #include diff --git a/cpp/include/raft/spatial/knn/detail/ann_quantized.cuh b/cpp/include/raft/spatial/knn/detail/ann_quantized.cuh index 5a56a84fe3..e5900ffd69 100644 --- a/cpp/include/raft/spatial/knn/detail/ann_quantized.cuh +++ b/cpp/include/raft/spatial/knn/detail/ann_quantized.cuh @@ -22,11 +22,11 @@ #include "common_faiss.h" #include "processing.cuh" -#include -#include +#include +#include #include -#include +#include #include #include diff --git a/cpp/include/raft/spatial/knn/detail/ann_utils.cuh b/cpp/include/raft/spatial/knn/detail/ann_utils.cuh index 6b7ab16eb2..6021e98221 100644 --- a/cpp/include/raft/spatial/knn/detail/ann_utils.cuh +++ b/cpp/include/raft/spatial/knn/detail/ann_utils.cuh @@ -16,10 +16,10 @@ #pragma once -#include -#include #include -#include +#include +#include +#include #include diff --git a/cpp/include/raft/spatial/knn/detail/ball_cover.cuh b/cpp/include/raft/spatial/knn/detail/ball_cover.cuh index 2f7c76a11d..457e1f495a 100644 --- a/cpp/include/raft/spatial/knn/detail/ball_cover.cuh +++ b/cpp/include/raft/spatial/knn/detail/ball_cover.cuh @@ -16,7 +16,7 @@ #pragma once -#include +#include #include "../ball_cover_common.h" #include "ball_cover/common.cuh" @@ -29,7 +29,7 @@ #include #include -#include +#include #include #include diff --git a/cpp/include/raft/spatial/knn/detail/ball_cover/registers.cuh b/cpp/include/raft/spatial/knn/detail/ball_cover/registers.cuh index 32f55c7931..88f5aa3460 100644 --- a/cpp/include/raft/spatial/knn/detail/ball_cover/registers.cuh +++ b/cpp/include/raft/spatial/knn/detail/ball_cover/registers.cuh @@ -26,7 +26,7 @@ #include #include -#include +#include #include #include diff --git a/cpp/include/raft/spatial/knn/detail/common_faiss.h b/cpp/include/raft/spatial/knn/detail/common_faiss.h index aca1571de2..b098d0991d 100644 --- a/cpp/include/raft/spatial/knn/detail/common_faiss.h +++ b/cpp/include/raft/spatial/knn/detail/common_faiss.h @@ -16,11 +16,11 @@ #pragma once -#include -#include +#include +#include #include -#include +#include namespace raft { namespace spatial { diff --git a/cpp/include/raft/spatial/knn/detail/haversine_distance.cuh b/cpp/include/raft/spatial/knn/detail/haversine_distance.cuh index 5d703bdb8d..b5ae9e7d5e 100644 --- a/cpp/include/raft/spatial/knn/detail/haversine_distance.cuh +++ b/cpp/include/raft/spatial/knn/detail/haversine_distance.cuh @@ -16,8 +16,8 @@ #pragma once -#include -#include +#include +#include #include #include @@ -25,8 +25,8 @@ #include #include -#include -#include +#include +#include #include namespace raft { diff --git a/cpp/include/raft/spatial/knn/detail/ivf_flat_build.cuh b/cpp/include/raft/spatial/knn/detail/ivf_flat_build.cuh index f1949ae307..2bc91d1b3b 100644 --- a/cpp/include/raft/spatial/knn/detail/ivf_flat_build.cuh +++ b/cpp/include/raft/spatial/knn/detail/ivf_flat_build.cuh @@ -24,7 +24,7 @@ #include #include #include -#include +#include #include diff --git a/cpp/include/raft/spatial/knn/detail/ivf_flat_search.cuh b/cpp/include/raft/spatial/knn/detail/ivf_flat_search.cuh index d8219a48f9..201cca5afe 100644 --- a/cpp/include/raft/spatial/knn/detail/ivf_flat_search.cuh +++ b/cpp/include/raft/spatial/knn/detail/ivf_flat_search.cuh @@ -21,16 +21,16 @@ #include "topk/radix_topk.cuh" #include "topk/warpsort_topk.cuh" -#include #include #include #include #include -#include #include -#include -#include -#include +#include +#include +#include +#include +#include #include #include @@ -1257,7 +1257,7 @@ inline void search(const handle_t& handle, case raft::distance::DistanceType::CosineExpanded: case raft::distance::DistanceType::CorrelationExpanded: // Similarity metrics have the opposite meaning, i.e. nearest neigbours are those with larger - // similarity (See the same logic at cpp/include/raft/sparse/selection/detail/knn.cuh:362 + // similarity (See the same logic at cpp/include/raft/sparse/spatial/detail/knn.cuh:362 // {perform_k_selection}) select_min = false; break; diff --git a/cpp/include/raft/spatial/knn/detail/knn_brute_force_faiss.cuh b/cpp/include/raft/spatial/knn/detail/knn_brute_force_faiss.cuh index 7cefeffea2..0c33c3f38f 100644 --- a/cpp/include/raft/spatial/knn/detail/knn_brute_force_faiss.cuh +++ b/cpp/include/raft/spatial/knn/detail/knn_brute_force_faiss.cuh @@ -16,8 +16,8 @@ #pragma once -#include -#include +#include +#include #include #include @@ -30,8 +30,8 @@ #include #include -#include -#include +#include +#include #include #include #include diff --git a/cpp/include/raft/spatial/knn/detail/processing.cuh b/cpp/include/raft/spatial/knn/detail/processing.cuh index a88b55e803..a80c1c1935 100644 --- a/cpp/include/raft/spatial/knn/detail/processing.cuh +++ b/cpp/include/raft/spatial/knn/detail/processing.cuh @@ -17,7 +17,7 @@ #include "processing.hpp" -#include +#include #include #include #include diff --git a/cpp/include/raft/spatial/knn/detail/selection_faiss.cuh b/cpp/include/raft/spatial/knn/detail/selection_faiss.cuh index 010bd5aaac..239379aad5 100644 --- a/cpp/include/raft/spatial/knn/detail/selection_faiss.cuh +++ b/cpp/include/raft/spatial/knn/detail/selection_faiss.cuh @@ -16,9 +16,9 @@ #pragma once -#include -#include #include +#include +#include #include #include diff --git a/cpp/include/raft/spatial/knn/detail/topk/bitonic_sort.cuh b/cpp/include/raft/spatial/knn/detail/topk/bitonic_sort.cuh index 44ffe6bc50..40ac7b0b92 100644 --- a/cpp/include/raft/spatial/knn/detail/topk/bitonic_sort.cuh +++ b/cpp/include/raft/spatial/knn/detail/topk/bitonic_sort.cuh @@ -16,7 +16,7 @@ #pragma once -#include +#include namespace raft::spatial::knn::detail::topk { diff --git a/cpp/include/raft/spatial/knn/detail/topk/radix_topk.cuh b/cpp/include/raft/spatial/knn/detail/topk/radix_topk.cuh index 53d88ff366..4cbad8e906 100644 --- a/cpp/include/raft/spatial/knn/detail/topk/radix_topk.cuh +++ b/cpp/include/raft/spatial/knn/detail/topk/radix_topk.cuh @@ -18,8 +18,8 @@ #include #include -#include -#include +#include +#include #include #include diff --git a/cpp/include/raft/spatial/knn/detail/topk/warpsort_topk.cuh b/cpp/include/raft/spatial/knn/detail/topk/warpsort_topk.cuh index 23448b6dc4..dfbe8a735d 100644 --- a/cpp/include/raft/spatial/knn/detail/topk/warpsort_topk.cuh +++ b/cpp/include/raft/spatial/knn/detail/topk/warpsort_topk.cuh @@ -19,8 +19,8 @@ #include "bitonic_sort.cuh" #include -#include -#include +#include +#include #include #include diff --git a/cpp/include/raft/spatial/knn/ivf_flat_types.hpp b/cpp/include/raft/spatial/knn/ivf_flat_types.hpp index 02c4e30c1f..b9d8db0404 100644 --- a/cpp/include/raft/spatial/knn/ivf_flat_types.hpp +++ b/cpp/include/raft/spatial/knn/ivf_flat_types.hpp @@ -16,11 +16,11 @@ #pragma once -#include "common.hpp" +#include "ann_types.hpp" #include #include -#include +#include #include #include diff --git a/cpp/include/raft/spatial/knn/knn.cuh b/cpp/include/raft/spatial/knn/knn.cuh index 52e7e31cc2..deed59195b 100644 --- a/cpp/include/raft/spatial/knn/knn.cuh +++ b/cpp/include/raft/spatial/knn/knn.cuh @@ -22,7 +22,7 @@ #include "detail/topk/radix_topk.cuh" #include "detail/topk/warpsort_topk.cuh" -#include +#include namespace raft::spatial::knn { diff --git a/cpp/include/raft/spatial/knn/specializations/ball_cover.cuh b/cpp/include/raft/spatial/knn/specializations/ball_cover.cuh index 033862c2f1..0c35bf4b9c 100644 --- a/cpp/include/raft/spatial/knn/specializations/ball_cover.cuh +++ b/cpp/include/raft/spatial/knn/specializations/ball_cover.cuh @@ -17,7 +17,7 @@ #pragma once #include -#include +#include #include #include diff --git a/cpp/include/raft/spectral/detail/lapack.hpp b/cpp/include/raft/spectral/detail/lapack.hpp index fa9cabf6a3..1bc930baf4 100644 --- a/cpp/include/raft/spectral/detail/lapack.hpp +++ b/cpp/include/raft/spectral/detail/lapack.hpp @@ -17,7 +17,7 @@ #pragma once #include -#include +#include #include #include diff --git a/cpp/include/raft/spectral/detail/matrix_wrappers.hpp b/cpp/include/raft/spectral/detail/matrix_wrappers.hpp index 7fcb912886..40388eea84 100644 --- a/cpp/include/raft/spectral/detail/matrix_wrappers.hpp +++ b/cpp/include/raft/spectral/detail/matrix_wrappers.hpp @@ -15,10 +15,10 @@ */ #pragma once -#include -#include +#include #include #include +#include #include #include diff --git a/cpp/include/raft/spectral/detail/spectral_util.cuh b/cpp/include/raft/spectral/detail/spectral_util.cuh index bb8e94b764..3a0ad1f96f 100644 --- a/cpp/include/raft/spectral/detail/spectral_util.cuh +++ b/cpp/include/raft/spectral/detail/spectral_util.cuh @@ -16,10 +16,10 @@ #pragma once -#include -#include +#include #include #include +#include #include #include diff --git a/cpp/include/raft/spectral/matrix_wrappers.hpp b/cpp/include/raft/spectral/matrix_wrappers.hpp index 952dac0715..1081d1a340 100644 --- a/cpp/include/raft/spectral/matrix_wrappers.hpp +++ b/cpp/include/raft/spectral/matrix_wrappers.hpp @@ -30,24 +30,19 @@ using size_type = int; // for now; TODO: move it in appropriate header // specifies type of algorithm used // for SpMv: // -using sparse_mv_alg_t = detail::sparse_mv_alg_t; +using detail::sparse_mv_alg_t; // Vector "view"-like aggregate for linear algebra purposes // -template -using vector_view_t = detail::vector_view_t; +using detail::vector_view_t; -template -using vector_t = detail::vector_t; +using detail::vector_t; -template -using sparse_matrix_t = detail::sparse_matrix_t; +using detail::sparse_matrix_t; -template -using laplacian_matrix_t = detail::laplacian_matrix_t; +using detail::laplacian_matrix_t; -template -using modularity_matrix_t = detail::modularity_matrix_t; +using detail::modularity_matrix_t; } // namespace matrix } // namespace spectral diff --git a/cpp/include/raft/stats/common.hpp b/cpp/include/raft/stats/common.hpp index da3f44a0fa..8392bd50fe 100644 --- a/cpp/include/raft/stats/common.hpp +++ b/cpp/include/raft/stats/common.hpp @@ -16,7 +16,7 @@ #pragma once -#include +#include // This file is a shameless amalgamation of independent works done by // Lars Nyland and Andy Adinets diff --git a/cpp/include/raft/stats/detail/adjusted_rand_index.cuh b/cpp/include/raft/stats/detail/adjusted_rand_index.cuh index ae33b9d1ac..120d4d1686 100644 --- a/cpp/include/raft/stats/detail/adjusted_rand_index.cuh +++ b/cpp/include/raft/stats/detail/adjusted_rand_index.cuh @@ -25,11 +25,11 @@ #include "contingencyMatrix.cuh" #include #include -#include -#include #include #include #include +#include +#include #include #include diff --git a/cpp/include/raft/stats/detail/batched/silhouette_score.cuh b/cpp/include/raft/stats/detail/batched/silhouette_score.cuh index 2f65d873b8..e3b56d2183 100644 --- a/cpp/include/raft/stats/detail/batched/silhouette_score.cuh +++ b/cpp/include/raft/stats/detail/batched/silhouette_score.cuh @@ -17,8 +17,8 @@ #pragma once #include "../silhouette_score.cuh" -#include -#include +#include +#include #include #include #include diff --git a/cpp/include/raft/stats/detail/contingencyMatrix.cuh b/cpp/include/raft/stats/detail/contingencyMatrix.cuh index 86d56a3d98..27dcb96247 100644 --- a/cpp/include/raft/stats/detail/contingencyMatrix.cuh +++ b/cpp/include/raft/stats/detail/contingencyMatrix.cuh @@ -16,8 +16,8 @@ #pragma once -#include -#include +#include +#include #include #include diff --git a/cpp/include/raft/stats/detail/dispersion.cuh b/cpp/include/raft/stats/detail/dispersion.cuh index 0c4d25b9aa..bca48045da 100644 --- a/cpp/include/raft/stats/detail/dispersion.cuh +++ b/cpp/include/raft/stats/detail/dispersion.cuh @@ -18,10 +18,10 @@ #include #include -#include -#include #include #include +#include +#include #include namespace raft { diff --git a/cpp/include/raft/stats/detail/entropy.cuh b/cpp/include/raft/stats/detail/entropy.cuh index d36fa1d7ba..fc4fc5fb6b 100644 --- a/cpp/include/raft/stats/detail/entropy.cuh +++ b/cpp/include/raft/stats/detail/entropy.cuh @@ -22,10 +22,10 @@ #pragma once #include #include -#include -#include #include #include +#include +#include #include #include diff --git a/cpp/include/raft/stats/detail/histogram.cuh b/cpp/include/raft/stats/detail/histogram.cuh index 65241f524f..54fe683b77 100644 --- a/cpp/include/raft/stats/detail/histogram.cuh +++ b/cpp/include/raft/stats/detail/histogram.cuh @@ -17,10 +17,10 @@ #pragma once #include -#include -#include #include -#include +#include +#include +#include #include // This file is a shameless amalgamation of independent works done by diff --git a/cpp/include/raft/stats/detail/kl_divergence.cuh b/cpp/include/raft/stats/detail/kl_divergence.cuh index 1a95aff531..d396d95206 100644 --- a/cpp/include/raft/stats/detail/kl_divergence.cuh +++ b/cpp/include/raft/stats/detail/kl_divergence.cuh @@ -22,9 +22,9 @@ #pragma once #include -#include -#include #include +#include +#include #include namespace raft { diff --git a/cpp/include/raft/stats/detail/mean.cuh b/cpp/include/raft/stats/detail/mean.cuh index a55b7b4cd1..49532e1c82 100644 --- a/cpp/include/raft/stats/detail/mean.cuh +++ b/cpp/include/raft/stats/detail/mean.cuh @@ -16,8 +16,8 @@ #pragma once -#include #include +#include #include diff --git a/cpp/include/raft/stats/detail/mean_center.cuh b/cpp/include/raft/stats/detail/mean_center.cuh index 1a4fc20c51..61017511b1 100644 --- a/cpp/include/raft/stats/detail/mean_center.cuh +++ b/cpp/include/raft/stats/detail/mean_center.cuh @@ -16,9 +16,9 @@ #pragma once -#include #include -#include +#include +#include namespace raft { namespace stats { diff --git a/cpp/include/raft/stats/detail/meanvar.cuh b/cpp/include/raft/stats/detail/meanvar.cuh index 1d4e1f95bd..a5cb315678 100644 --- a/cpp/include/raft/stats/detail/meanvar.cuh +++ b/cpp/include/raft/stats/detail/meanvar.cuh @@ -16,8 +16,8 @@ #pragma once -#include #include +#include namespace raft::stats::detail { diff --git a/cpp/include/raft/stats/detail/minmax.cuh b/cpp/include/raft/stats/detail/minmax.cuh index 2a4a9bff93..1ccd725189 100644 --- a/cpp/include/raft/stats/detail/minmax.cuh +++ b/cpp/include/raft/stats/detail/minmax.cuh @@ -16,8 +16,8 @@ #pragma once -#include -#include +#include +#include #include diff --git a/cpp/include/raft/stats/detail/mutual_info_score.cuh b/cpp/include/raft/stats/detail/mutual_info_score.cuh index c730ac0362..fb454ee6ad 100644 --- a/cpp/include/raft/stats/detail/mutual_info_score.cuh +++ b/cpp/include/raft/stats/detail/mutual_info_score.cuh @@ -27,11 +27,11 @@ #include #include -#include -#include #include #include #include +#include +#include #include #include diff --git a/cpp/include/raft/stats/detail/rand_index.cuh b/cpp/include/raft/stats/detail/rand_index.cuh index 19f8e56121..a827427d8f 100644 --- a/cpp/include/raft/stats/detail/rand_index.cuh +++ b/cpp/include/raft/stats/detail/rand_index.cuh @@ -54,9 +54,9 @@ #include #include -#include -#include #include +#include +#include #include namespace raft { diff --git a/cpp/include/raft/stats/detail/scores.cuh b/cpp/include/raft/stats/detail/scores.cuh index 85fd8290b3..6bad1f9159 100644 --- a/cpp/include/raft/stats/detail/scores.cuh +++ b/cpp/include/raft/stats/detail/scores.cuh @@ -17,13 +17,13 @@ #pragma once #include -#include #include #include #include #include #include #include +#include #include #include #include diff --git a/cpp/include/raft/stats/detail/silhouette_score.cuh b/cpp/include/raft/stats/detail/silhouette_score.cuh index aa100f7299..f2e138ed6f 100644 --- a/cpp/include/raft/stats/detail/silhouette_score.cuh +++ b/cpp/include/raft/stats/detail/silhouette_score.cuh @@ -21,15 +21,15 @@ #include #include #include -#include #include -#include +#include #include #include #include #include #include #include +#include #include namespace raft { diff --git a/cpp/include/raft/stats/detail/stddev.cuh b/cpp/include/raft/stats/detail/stddev.cuh index b9149b5a9f..ccea2ea5da 100644 --- a/cpp/include/raft/stats/detail/stddev.cuh +++ b/cpp/include/raft/stats/detail/stddev.cuh @@ -16,8 +16,8 @@ #pragma once -#include #include +#include #include diff --git a/cpp/include/raft/stats/detail/sum.cuh b/cpp/include/raft/stats/detail/sum.cuh index 3652a852de..b6d5b8a30d 100644 --- a/cpp/include/raft/stats/detail/sum.cuh +++ b/cpp/include/raft/stats/detail/sum.cuh @@ -16,8 +16,8 @@ #pragma once -#include #include +#include #include diff --git a/cpp/include/raft/stats/detail/weighted_mean.cuh b/cpp/include/raft/stats/detail/weighted_mean.cuh index 9c17d2ed0f..e8f85b4af3 100644 --- a/cpp/include/raft/stats/detail/weighted_mean.cuh +++ b/cpp/include/raft/stats/detail/weighted_mean.cuh @@ -16,9 +16,9 @@ #pragma once -#include #include #include +#include namespace raft { namespace stats { diff --git a/cpp/include/raft/stats/mean.cuh b/cpp/include/raft/stats/mean.cuh index eed3159d5d..976b58c048 100644 --- a/cpp/include/raft/stats/mean.cuh +++ b/cpp/include/raft/stats/mean.cuh @@ -21,7 +21,7 @@ #include "detail/mean.cuh" -#include +#include namespace raft { namespace stats { diff --git a/cpp/include/raft/stats/minmax.cuh b/cpp/include/raft/stats/minmax.cuh index 62533b1a00..431d06ec6f 100644 --- a/cpp/include/raft/stats/minmax.cuh +++ b/cpp/include/raft/stats/minmax.cuh @@ -18,9 +18,9 @@ #pragma once -#include -#include #include +#include +#include #include diff --git a/cpp/include/raft/stats/stddev.cuh b/cpp/include/raft/stats/stddev.cuh index 72df090939..3fc41ebc8c 100644 --- a/cpp/include/raft/stats/stddev.cuh +++ b/cpp/include/raft/stats/stddev.cuh @@ -20,7 +20,7 @@ #include "detail/stddev.cuh" -#include +#include namespace raft { namespace stats { diff --git a/cpp/include/raft/stats/sum.cuh b/cpp/include/raft/stats/sum.cuh index 2e07e9aafa..89135dd076 100644 --- a/cpp/include/raft/stats/sum.cuh +++ b/cpp/include/raft/stats/sum.cuh @@ -21,7 +21,7 @@ #include "detail/sum.cuh" -#include +#include namespace raft { namespace stats { diff --git a/cpp/include/raft/util/cache_util.cuh b/cpp/include/raft/util/cache_util.cuh new file mode 100644 index 0000000000..2d6f49eb19 --- /dev/null +++ b/cpp/include/raft/util/cache_util.cuh @@ -0,0 +1,368 @@ +/* + * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include + +namespace raft { +namespace cache { + +/** + * @brief Collect vectors of data from the cache into a contiguous memory buffer. + * + * We assume contiguous memory layout for the output buffer, i.e. we get + * column vectors into a column major out buffer, or row vectors into a row + * major output buffer. + * + * On exit, the output array is filled the following way: + * out[i + n_vec*k] = cache[i + n_vec * cache_idx[k]]), where i=0..n_vec-1, and + * k = 0..n-1 where cache_idx[k] >= 0 + * + * We ignore vectors where cache_idx[k] < 0. + * + * @param [in] cache stores the cached data, size [n_vec x n_cached_vectors] + * @param [in] n_vec number of elements in a cached vector + * @param [in] cache_idx cache indices, size [n] + * @param [in] n the number of elements that need to be collected + * @param [out] out vectors collected from the cache, size [n_vec * n] + */ +template +__global__ void get_vecs( + const math_t* cache, int_t n_vec, const idx_t* cache_idx, int_t n, math_t* out) +{ + int tid = threadIdx.x + blockIdx.x * blockDim.x; + int row = tid % n_vec; // row idx + if (tid < n_vec * n) { + size_t out_col = tid / n_vec; // col idx + size_t cache_col = cache_idx[out_col]; + if (cache_idx[out_col] >= 0) { + if (row + out_col * n_vec < (size_t)n_vec * n) { out[tid] = cache[row + cache_col * n_vec]; } + } + } +} + +/** + * @brief Store vectors of data into the cache. + * + * Elements within a vector should be contiguous in memory (i.e. column vectors + * for column major data storage, or row vectors of row major data). + * + * If tile_idx==nullptr then the operation is the opposite of get_vecs, + * i.e. we store + * cache[i + cache_idx[k]*n_vec] = tile[i + k*n_vec], for i=0..n_vec-1, k=0..n-1 + * + * If tile_idx != nullptr, then we permute the vectors from tile according + * to tile_idx. This allows to store vectors from a buffer where the individual + * vectors are not stored contiguously (but the elements of each vector shall + * be contiguous): + * cache[i + cache_idx[k]*n_vec] = tile[i + tile_idx[k]*n_vec], + * for i=0..n_vec-1, k=0..n-1 + * + * @param [in] tile stores the data to be cashed cached, size [n_vec x n_tile] + * @param [in] n_tile number of vectors in the input tile + * @param [in] n_vec number of elements in a cached vector + * @param [in] tile_idx indices of vectors that we want to store + * @param [in] n number of vectos that we want to store (n <= n_tile) + * @param [in] cache_idx cache indices, size [n], negative values are ignored + * @param [inout] cache updated cache + * @param [in] n_cache_vecs + */ +template +__global__ void store_vecs(const math_t* tile, + int n_tile, + int n_vec, + const int* tile_idx, + int n, + const int* cache_idx, + math_t* cache, + int n_cache_vecs) +{ + int tid = threadIdx.x + blockIdx.x * blockDim.x; + int row = tid % n_vec; // row idx + if (tid < n_vec * n) { + int tile_col = tid / n_vec; // col idx + int data_col = tile_idx ? tile_idx[tile_col] : tile_col; + int cache_col = cache_idx[tile_col]; + + // We ignore negative values. The rest of the checks should be fulfilled + // if the cache is used properly + if (cache_col >= 0 && cache_col < n_cache_vecs && data_col < n_tile) { + cache[row + (size_t)cache_col * n_vec] = tile[row + (size_t)data_col * n_vec]; + } + } +} + +/** + * @brief Map a key to a cache set. + * + * @param key key to be hashed + * @param n_cache_sets number of cache sets + * @return index of the cache set [0..n_cache_set) + */ +int DI hash(int key, int n_cache_sets) { return key % n_cache_sets; } + +/** + * @brief Binary search to find the first element in the array which is greater + * equal than a given value. + * @param [in] array sorted array of n numbers + * @param [in] n length of the array + * @param [in] val the value to search for + * @return the index of the first element in the array for which + * array[idx] >= value. If there is no such value, then return n. + */ +int DI arg_first_ge(const int* array, int n, int val) +{ + int start = 0; + int end = n - 1; + if (array[0] == val) return 0; + if (array[end] < val) return n; + while (start + 1 < end) { + int q = (start + end + 1) / 2; + // invariants: + // start < end + // start < q <=end + // array[start] < val && array[end] <=val + // at every iteration d = end-start is decreasing + // when d==0, then array[end] will be the first element >= val. + if (array[q] >= val) { + end = q; + } else { + start = q; + } + } + return end; +} +/** + * @brief Find the k-th occurrence of value in a sorted array. + * + * Assume that array is [0, 1, 1, 1, 2, 2, 4, 4, 4, 4, 6, 7] + * then find_nth_occurrence(cset, 12, 4, 2) == 7, because cset_array[7] stores + * the second element with value = 4. + * If there are less than k values in the array, then return -1 + * + * @param [in] array sorted array of numbers, size [n] + * @param [in] n number of elements in the array + * @param [in] val the value we are searching for + * @param [in] k + * @return the idx of the k-th occurance of val in array, or -1 if + * the value is not found. + */ +int DI find_nth_occurrence(const int* array, int n, int val, int k) +{ + int q = arg_first_ge(array, n, val); + if (q + k < n && array[q + k] == val) { + q += k; + } else { + q = -1; + } + return q; +} + +/** + * @brief Rank the entries in a cache set according to the time stamp, return + * the indices that would sort the time stamp in ascending order. + * + * Assume we have a single cache set with time stamps as: + * key (threadIdx.x): 0 1 2 3 + * val (time stamp): 8 6 7 5 + * + * The corresponding sorted key-value pairs: + * key: 3 1 2 0 + * val: 5 6 7 8 + * rank: 0th 1st 2nd 3rd + * + * On return, the rank is assigned for each thread: + * threadIdx.x: 0 1 2 3 + * rank: 3 1 2 0 + * + * For multiple cache sets, launch one block per cache set. + * + * @tparam nthreads number of threads per block (nthreads <= associativity) + * @tparam associativity number of items in a cache set + * + * @param [in] cache_time time stamp of caching the data, + size [associativity * n_cache_sets] + * @param [in] n_cache_sets number of cache sets + * @param [out] rank within the cache set size [nthreads * items_per_thread] + * Each block should give a different pointer for rank. + */ +template +DI void rank_set_entries(const int* cache_time, int n_cache_sets, int* rank) +{ + const int items_per_thread = raft::ceildiv(associativity, nthreads); + typedef cub::BlockRadixSort BlockRadixSort; + __shared__ typename BlockRadixSort::TempStorage temp_storage; + + int key[items_per_thread]; + int val[items_per_thread]; + + int block_offset = blockIdx.x * associativity; + + for (int j = 0; j < items_per_thread; j++) { + int k = threadIdx.x + j * nthreads; + int t = (k < associativity) ? cache_time[block_offset + k] : 32768; + key[j] = t; + val[j] = k; + } + + BlockRadixSort(temp_storage).Sort(key, val); + + for (int j = 0; j < items_per_thread; j++) { + if (val[j] < associativity) { rank[val[j]] = threadIdx.x * items_per_thread + j; } + } + __syncthreads(); +} + +/** + * @brief Assign cache location to a set of keys using LRU replacement policy. + * + * The keys and the corresponding cache_set arrays shall be sorted according + * to cache_set in ascending order. One block should be launched for every cache + * set. + * + * Each cache set is sorted according to time_stamp, and values from keys + * are filled in starting at the oldest time stamp. Entries that were accessed + * at the current time are not reassigned. + * + * @tparam nthreads number of threads per block + * @tparam associativity number of keys in a cache set + * + * @param [in] keys that we want to cache size [n] + * @param [in] n number of keys + * @param [in] cache_set assigned to keys, size [n] + * @param [inout] cached_keys keys of already cached vectors, + * size [n_cache_sets*associativity], on exit it will be updated with the + * cached elements from keys. + * @param [in] n_cache_sets number of cache sets + * @param [inout] cache_time will be updated to "time" for those elements that + * could be assigned to a cache location, size [n_cache_sets*associativity] + * @param [in] time time stamp + * @param [out] cache_idx the cache idx assigned to the input, or -1 if it could + * not be cached, size [n] + */ +template +__global__ void assign_cache_idx(const int* keys, + int n, + const int* cache_set, + int* cached_keys, + int n_cache_sets, + int* cache_time, + int time, + int* cache_idx) +{ + int block_offset = blockIdx.x * associativity; + + const int items_per_thread = raft::ceildiv(associativity, nthreads); + + // the size of rank limits how large associativity can be used in practice + __shared__ int rank[items_per_thread * nthreads]; + rank_set_entries(cache_time, n_cache_sets, rank); + + // Each thread will fill items_per_thread items in the cache. + // It uses a place, only if it was not updated at the current time step + // (cache_time != time). + // We rank the places according to the time stamp, least recently used + // elements come to the front. + // We fill the least recently used elements with the working set. + // there might be elements which cannot be assigned to cache loc. + // these elements are assigned -1. + + for (int j = 0; j < items_per_thread; j++) { + int i = threadIdx.x + j * nthreads; + int t_idx = block_offset + i; + bool mask = (i < associativity); + // whether this slot is available for writing + mask = mask && (cache_time[t_idx] != time); + + // rank[i] tells which element to store by this thread + // we look up where is the corresponding key stored in the input array + if (mask) { + int k = find_nth_occurrence(cache_set, n, blockIdx.x, rank[i]); + if (k > -1) { + int key_val = keys[k]; + cached_keys[t_idx] = key_val; + cache_idx[k] = t_idx; + cache_time[t_idx] = time; + } + } + } +} + +/* Unnamed namespace is used to avoid multiple definition error for the + following non-template function */ +namespace { +/** + * @brief Get the cache indices for keys stored in the cache. + * + * For every key, we look up the corresponding cache position. + * If keys[k] is stored in the cache, then is_cached[k] is set to true, and + * cache_idx[k] stores the corresponding cache idx. + * + * If keys[k] is not stored in the cache, then we assign a cache set to it. + * This cache set is stored in cache_idx[k], and is_cached[k] is set to false. + * In this case AssignCacheIdx should be called, to get an assigned position + * within the cache set. + * + * Cache_time is assigned to the time input argument for all elements in idx. + * + * @param [in] keys array of keys that we want to look up in the cache, size [n] + * @param [in] n number of keys to look up + * @param [inout] cached_keys keys stored in the cache, size [n_cache_sets * associativity] + * @param [in] n_cache_sets number of cache sets + * @param [in] associativity number of keys in cache set + * @param [inout] cache_time time stamp when the indices were cached, size [n_cache_sets * + * associativity] + * @param [out] cache_idx cache indices of the working set elements, size [n] + * @param [out] is_cached whether the element is cached size[n] + * @param [in] time iteration counter (used for time stamping) + */ +__global__ void get_cache_idx(int* keys, + int n, + int* cached_keys, + int n_cache_sets, + int associativity, + int* cache_time, + int* cache_idx, + bool* is_cached, + int time) +{ + int tid = threadIdx.x + blockIdx.x * blockDim.x; + if (tid < n) { + int widx = keys[tid]; + int sidx = hash(widx, n_cache_sets); + int cidx = sidx * associativity; + int i = 0; + bool found = false; + // search for empty spot and the least recently used spot + while (i < associativity && !found) { + found = (cache_time[cidx + i] > 0 && cached_keys[cidx + i] == widx); + i++; + } + is_cached[tid] = found; + if (found) { + cidx = cidx + i - 1; + cache_time[cidx] = time; // update time stamp + cache_idx[tid] = cidx; // exact cache idx + } else { + cache_idx[tid] = sidx; // assign cache set + } + } +} +}; // end unnamed namespace +}; // namespace cache +}; // namespace raft diff --git a/cpp/include/raft/util/cuda_utils.cuh b/cpp/include/raft/util/cuda_utils.cuh new file mode 100644 index 0000000000..1d1c82eb94 --- /dev/null +++ b/cpp/include/raft/util/cuda_utils.cuh @@ -0,0 +1,794 @@ +/* + * Copyright (c) 2018-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include + +#include + +#ifndef ENABLE_MEMCPY_ASYNC +// enable memcpy_async interface by default for newer GPUs +#if __CUDA_ARCH__ >= 800 +#define ENABLE_MEMCPY_ASYNC 1 +#endif +#else // ENABLE_MEMCPY_ASYNC +// disable memcpy_async for all older GPUs +#if __CUDA_ARCH__ < 800 +#define ENABLE_MEMCPY_ASYNC 0 +#endif +#endif // ENABLE_MEMCPY_ASYNC + +namespace raft { + +/** helper macro for device inlined functions */ +#define DI inline __device__ +#define HDI inline __host__ __device__ +#define HD __host__ __device__ + +/** + * @brief Provide a ceiling division operation ie. ceil(a / b) + * @tparam IntType supposed to be only integers for now! + */ +template +constexpr HDI IntType ceildiv(IntType a, IntType b) +{ + return (a + b - 1) / b; +} + +/** + * @brief Provide an alignment function ie. ceil(a / b) * b + * @tparam IntType supposed to be only integers for now! + */ +template +constexpr HDI IntType alignTo(IntType a, IntType b) +{ + return ceildiv(a, b) * b; +} + +/** + * @brief Provide an alignment function ie. (a / b) * b + * @tparam IntType supposed to be only integers for now! + */ +template +constexpr HDI IntType alignDown(IntType a, IntType b) +{ + return (a / b) * b; +} + +/** + * @brief Check if the input is a power of 2 + * @tparam IntType data type (checked only for integers) + */ +template +constexpr HDI bool isPo2(IntType num) +{ + return (num && !(num & (num - 1))); +} + +/** + * @brief Give logarithm of the number to base-2 + * @tparam IntType data type (checked only for integers) + */ +template +constexpr HDI IntType log2(IntType num, IntType ret = IntType(0)) +{ + return num <= IntType(1) ? ret : log2(num >> IntType(1), ++ret); +} + +/** Device function to apply the input lambda across threads in the grid */ +template +DI void forEach(int num, L lambda) +{ + int idx = (blockDim.x * blockIdx.x) + threadIdx.x; + const int numThreads = blockDim.x * gridDim.x; +#pragma unroll + for (int itr = 0; itr < ItemsPerThread; ++itr, idx += numThreads) { + if (idx < num) lambda(idx, itr); + } +} + +/** number of threads per warp */ +static const int WarpSize = 32; + +/** get the laneId of the current thread */ +DI int laneId() +{ + int id; + asm("mov.s32 %0, %%laneid;" : "=r"(id)); + return id; +} + +/** + * @brief Swap two values + * @tparam T the datatype of the values + * @param a first input + * @param b second input + */ +template +HDI void swapVals(T& a, T& b) +{ + T tmp = a; + a = b; + b = tmp; +} + +/** Device function to have atomic add support for older archs */ +template +DI void myAtomicAdd(Type* address, Type val) +{ + atomicAdd(address, val); +} + +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 600) +// Ref: +// http://on-demand.gputechconf.com/gtc/2013/presentations/S3101-Atomic-Memory-Operations.pdf +template <> +DI void myAtomicAdd(double* address, double val) +{ + unsigned long long int* address_as_ull = (unsigned long long int*)address; + unsigned long long int old = *address_as_ull, assumed; + do { + assumed = old; + old = + atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed))); + } while (assumed != old); +} +#endif + +template +DI void myAtomicReduce(T* address, T val, ReduceLambda op); + +template +DI void myAtomicReduce(double* address, double val, ReduceLambda op) +{ + unsigned long long int* address_as_ull = (unsigned long long int*)address; + unsigned long long int old = *address_as_ull, assumed; + do { + assumed = old; + old = atomicCAS( + address_as_ull, assumed, __double_as_longlong(op(val, __longlong_as_double(assumed)))); + } while (assumed != old); +} + +template +DI void myAtomicReduce(float* address, float val, ReduceLambda op) +{ + unsigned int* address_as_uint = (unsigned int*)address; + unsigned int old = *address_as_uint, assumed; + do { + assumed = old; + old = atomicCAS(address_as_uint, assumed, __float_as_uint(op(val, __uint_as_float(assumed)))); + } while (assumed != old); +} + +template +DI void myAtomicReduce(int* address, int val, ReduceLambda op) +{ + int old = *address, assumed; + do { + assumed = old; + old = atomicCAS(address, assumed, op(val, assumed)); + } while (assumed != old); +} + +template +DI void myAtomicReduce(long long* address, long long val, ReduceLambda op) +{ + long long old = *address, assumed; + do { + assumed = old; + old = atomicCAS(address, assumed, op(val, assumed)); + } while (assumed != old); +} + +template +DI void myAtomicReduce(unsigned long long* address, unsigned long long val, ReduceLambda op) +{ + unsigned long long old = *address, assumed; + do { + assumed = old; + old = atomicCAS(address, assumed, op(val, assumed)); + } while (assumed != old); +} + +/** + * @brief Provide atomic min operation. + * @tparam T: data type for input data (float or double). + * @param[in] address: address to read old value from, and to atomically update w/ min(old value, + * val) + * @param[in] val: new value to compare with old + */ +template +DI T myAtomicMin(T* address, T val); + +/** + * @brief Provide atomic max operation. + * @tparam T: data type for input data (float or double). + * @param[in] address: address to read old value from, and to atomically update w/ max(old value, + * val) + * @param[in] val: new value to compare with old + */ +template +DI T myAtomicMax(T* address, T val); + +DI float myAtomicMin(float* address, float val) +{ + myAtomicReduce(address, val, fminf); + return *address; +} + +DI float myAtomicMax(float* address, float val) +{ + myAtomicReduce(address, val, fmaxf); + return *address; +} + +DI double myAtomicMin(double* address, double val) +{ + myAtomicReduce(address, val, fmin); + return *address; +} + +DI double myAtomicMax(double* address, double val) +{ + myAtomicReduce(address, val, fmax); + return *address; +} + +/** + * @defgroup Max maximum of two numbers + * @{ + */ +template +HDI T myMax(T x, T y); +template <> +HDI float myMax(float x, float y) +{ + return fmaxf(x, y); +} +template <> +HDI double myMax(double x, double y) +{ + return fmax(x, y); +} +/** @} */ + +/** + * @defgroup Min minimum of two numbers + * @{ + */ +template +HDI T myMin(T x, T y); +template <> +HDI float myMin(float x, float y) +{ + return fminf(x, y); +} +template <> +HDI double myMin(double x, double y) +{ + return fmin(x, y); +} +/** @} */ + +/** + * @brief Provide atomic min operation. + * @tparam T: data type for input data (float or double). + * @param[in] address: address to read old value from, and to atomically update w/ min(old value, + * val) + * @param[in] val: new value to compare with old + */ +template +DI T myAtomicMin(T* address, T val) +{ + myAtomicReduce(address, val, myMin); + return *address; +} + +/** + * @brief Provide atomic max operation. + * @tparam T: data type for input data (float or double). + * @param[in] address: address to read old value from, and to atomically update w/ max(old value, + * val) + * @param[in] val: new value to compare with old + */ +template +DI T myAtomicMax(T* address, T val) +{ + myAtomicReduce(address, val, myMax); + return *address; +} + +/** + * Sign function + */ +template +HDI int sgn(const T val) +{ + return (T(0) < val) - (val < T(0)); +} + +/** + * @defgroup Exp Exponential function + * @{ + */ +template +HDI T myExp(T x); +template <> +HDI float myExp(float x) +{ + return expf(x); +} +template <> +HDI double myExp(double x) +{ + return exp(x); +} +/** @} */ + +/** + * @defgroup Cuda infinity values + * @{ + */ +template +inline __device__ T myInf(); +template <> +inline __device__ float myInf() +{ + return CUDART_INF_F; +} +template <> +inline __device__ double myInf() +{ + return CUDART_INF; +} +/** @} */ + +/** + * @defgroup Log Natural logarithm + * @{ + */ +template +HDI T myLog(T x); +template <> +HDI float myLog(float x) +{ + return logf(x); +} +template <> +HDI double myLog(double x) +{ + return log(x); +} +/** @} */ + +/** + * @defgroup Sqrt Square root + * @{ + */ +template +HDI T mySqrt(T x); +template <> +HDI float mySqrt(float x) +{ + return sqrtf(x); +} +template <> +HDI double mySqrt(double x) +{ + return sqrt(x); +} +/** @} */ + +/** + * @defgroup SineCosine Sine and cosine calculation + * @{ + */ +template +DI void mySinCos(T x, T& s, T& c); +template <> +DI void mySinCos(float x, float& s, float& c) +{ + sincosf(x, &s, &c); +} +template <> +DI void mySinCos(double x, double& s, double& c) +{ + sincos(x, &s, &c); +} +/** @} */ + +/** + * @defgroup Sine Sine calculation + * @{ + */ +template +DI T mySin(T x); +template <> +DI float mySin(float x) +{ + return sinf(x); +} +template <> +DI double mySin(double x) +{ + return sin(x); +} +/** @} */ + +/** + * @defgroup Abs Absolute value + * @{ + */ +template +DI T myAbs(T x) +{ + return x < 0 ? -x : x; +} +template <> +DI float myAbs(float x) +{ + return fabsf(x); +} +template <> +DI double myAbs(double x) +{ + return fabs(x); +} +/** @} */ + +/** + * @defgroup Pow Power function + * @{ + */ +template +HDI T myPow(T x, T power); +template <> +HDI float myPow(float x, float power) +{ + return powf(x, power); +} +template <> +HDI double myPow(double x, double power) +{ + return pow(x, power); +} +/** @} */ + +/** + * @defgroup myTanh tanh function + * @{ + */ +template +HDI T myTanh(T x); +template <> +HDI float myTanh(float x) +{ + return tanhf(x); +} +template <> +HDI double myTanh(double x) +{ + return tanh(x); +} +/** @} */ + +/** + * @defgroup myATanh arctanh function + * @{ + */ +template +HDI T myATanh(T x); +template <> +HDI float myATanh(float x) +{ + return atanhf(x); +} +template <> +HDI double myATanh(double x) +{ + return atanh(x); +} +/** @} */ + +/** + * @defgroup LambdaOps Lambda operations in reduction kernels + * @{ + */ +// IdxType mostly to be used for MainLambda in *Reduction kernels +template +struct Nop { + HDI Type operator()(Type in, IdxType i = 0) { return in; } +}; + +template +struct L1Op { + HDI Type operator()(Type in, IdxType i = 0) { return myAbs(in); } +}; + +template +struct L2Op { + HDI Type operator()(Type in, IdxType i = 0) { return in * in; } +}; + +template +struct Sum { + HDI Type operator()(Type a, Type b) { return a + b; } +}; +/** @} */ + +/** + * @defgroup Sign Obtain sign value + * @brief Obtain sign of x + * @param x input + * @return +1 if x >= 0 and -1 otherwise + * @{ + */ +template +DI T signPrim(T x) +{ + return x < 0 ? -1 : +1; +} +template <> +DI float signPrim(float x) +{ + return signbit(x) == true ? -1.0f : +1.0f; +} +template <> +DI double signPrim(double x) +{ + return signbit(x) == true ? -1.0 : +1.0; +} +/** @} */ + +/** + * @defgroup Max maximum of two numbers + * @brief Obtain maximum of two values + * @param x one item + * @param y second item + * @return maximum of two items + * @{ + */ +template +DI T maxPrim(T x, T y) +{ + return x > y ? x : y; +} +template <> +DI float maxPrim(float x, float y) +{ + return fmaxf(x, y); +} +template <> +DI double maxPrim(double x, double y) +{ + return fmax(x, y); +} +/** @} */ + +/** apply a warp-wide fence (useful from Volta+ archs) */ +DI void warpFence() +{ +#if __CUDA_ARCH__ >= 700 + __syncwarp(); +#endif +} + +/** warp-wide any boolean aggregator */ +DI bool any(bool inFlag, uint32_t mask = 0xffffffffu) +{ +#if CUDART_VERSION >= 9000 + inFlag = __any_sync(mask, inFlag); +#else + inFlag = __any(inFlag); +#endif + return inFlag; +} + +/** warp-wide all boolean aggregator */ +DI bool all(bool inFlag, uint32_t mask = 0xffffffffu) +{ +#if CUDART_VERSION >= 9000 + inFlag = __all_sync(mask, inFlag); +#else + inFlag = __all(inFlag); +#endif + return inFlag; +} + +/** + * @brief Shuffle the data inside a warp + * @tparam T the data type (currently assumed to be 4B) + * @param val value to be shuffled + * @param srcLane lane from where to shuffle + * @param width lane width + * @param mask mask of participating threads (Volta+) + * @return the shuffled data + */ +template +DI T shfl(T val, int srcLane, int width = WarpSize, uint32_t mask = 0xffffffffu) +{ +#if CUDART_VERSION >= 9000 + return __shfl_sync(mask, val, srcLane, width); +#else + return __shfl(val, srcLane, width); +#endif +} + +/** + * @brief Shuffle the data inside a warp from lower lane IDs + * @tparam T the data type (currently assumed to be 4B) + * @param val value to be shuffled + * @param delta lower lane ID delta from where to shuffle + * @param width lane width + * @param mask mask of participating threads (Volta+) + * @return the shuffled data + */ +template +DI T shfl_up(T val, int delta, int width = WarpSize, uint32_t mask = 0xffffffffu) +{ +#if CUDART_VERSION >= 9000 + return __shfl_up_sync(mask, val, delta, width); +#else + return __shfl_up(val, delta, width); +#endif +} + +/** + * @brief Shuffle the data inside a warp + * @tparam T the data type (currently assumed to be 4B) + * @param val value to be shuffled + * @param laneMask mask to be applied in order to perform xor shuffle + * @param width lane width + * @param mask mask of participating threads (Volta+) + * @return the shuffled data + */ +template +DI T shfl_xor(T val, int laneMask, int width = WarpSize, uint32_t mask = 0xffffffffu) +{ +#if CUDART_VERSION >= 9000 + return __shfl_xor_sync(mask, val, laneMask, width); +#else + return __shfl_xor(val, laneMask, width); +#endif +} + +/** + * @brief Four-way byte dot product-accumulate. + * @tparam T Four-byte integer: int or unsigned int + * @tparam S Either same as T or a 4-byte vector of the same signedness. + * + * @param a + * @param b + * @param c + * @return dot(a, b) + c + */ +template +DI auto dp4a(S a, S b, T c) -> T; + +template <> +DI auto dp4a(char4 a, char4 b, int c) -> int +{ +#if __CUDA_ARCH__ >= 610 + return __dp4a(a, b, c); +#else + c += static_cast(a.x) * static_cast(b.x); + c += static_cast(a.y) * static_cast(b.y); + c += static_cast(a.z) * static_cast(b.z); + c += static_cast(a.w) * static_cast(b.w); + return c; +#endif +} + +template <> +DI auto dp4a(uchar4 a, uchar4 b, unsigned int c) -> unsigned int +{ +#if __CUDA_ARCH__ >= 610 + return __dp4a(a, b, c); +#else + c += static_cast(a.x) * static_cast(b.x); + c += static_cast(a.y) * static_cast(b.y); + c += static_cast(a.z) * static_cast(b.z); + c += static_cast(a.w) * static_cast(b.w); + return c; +#endif +} + +template <> +DI auto dp4a(int a, int b, int c) -> int +{ +#if __CUDA_ARCH__ >= 610 + return __dp4a(a, b, c); +#else + return dp4a(*reinterpret_cast(&a), *reinterpret_cast(&b), c); +#endif +} + +template <> +DI auto dp4a(unsigned int a, unsigned int b, unsigned int c) -> unsigned int +{ +#if __CUDA_ARCH__ >= 610 + return __dp4a(a, b, c); +#else + return dp4a(*reinterpret_cast(&a), *reinterpret_cast(&b), c); +#endif +} + +/** + * @brief Warp-level sum reduction + * @param val input value + * @tparam T Value type to be reduced + * @return Reduction result. All lanes will have the valid result. + * @note Why not cub? Because cub doesn't seem to allow working with arbitrary + * number of warps in a block. All threads in the warp must enter this + * function together + * @todo Expand this to support arbitrary reduction ops + */ +template +DI T warpReduce(T val) +{ +#pragma unroll + for (int i = WarpSize / 2; i > 0; i >>= 1) { + T tmp = shfl_xor(val, i); + val += tmp; + } + return val; +} + +/** + * @brief 1-D block-level sum reduction + * @param val input value + * @param smem shared memory region needed for storing intermediate results. It + * must alteast be of size: `sizeof(T) * nWarps` + * @return only the thread0 will contain valid reduced result + * @note Why not cub? Because cub doesn't seem to allow working with arbitrary + * number of warps in a block. All threads in the block must enter this + * function together + * @todo Expand this to support arbitrary reduction ops + */ +template +DI T blockReduce(T val, char* smem) +{ + auto* sTemp = reinterpret_cast(smem); + int nWarps = (blockDim.x + WarpSize - 1) / WarpSize; + int lid = laneId(); + int wid = threadIdx.x / WarpSize; + val = warpReduce(val); + if (lid == 0) sTemp[wid] = val; + __syncthreads(); + val = lid < nWarps ? sTemp[lid] : T(0); + return warpReduce(val); +} + +/** + * @brief Simple utility function to determine whether user_stream or one of the + * internal streams should be used. + * @param user_stream main user stream + * @param int_streams array of internal streams + * @param n_int_streams number of internal streams + * @param idx the index for which to query the stream + */ +inline cudaStream_t select_stream(cudaStream_t user_stream, + cudaStream_t* int_streams, + int n_int_streams, + int idx) +{ + return n_int_streams > 0 ? int_streams[idx % n_int_streams] : user_stream; +} + +} // namespace raft diff --git a/cpp/include/raft/util/cudart_utils.hpp b/cpp/include/raft/util/cudart_utils.hpp new file mode 100644 index 0000000000..2bff802afa --- /dev/null +++ b/cpp/include/raft/util/cudart_utils.hpp @@ -0,0 +1,499 @@ +/* + * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * This file is deprecated and will be removed in release 22.06. + * Please use raft_runtime/cudart_utils.hpp instead. + */ + +#ifndef __RAFT_RT_CUDART_UTILS_H +#define __RAFT_RT_CUDART_UTILS_H + +#pragma once + +#include +#include +#include +#include +#include + +#include + +#include +#include +#include +#include +#include +#include +#include + +///@todo: enable once logging has been enabled in raft +//#include "logger.hpp" + +namespace raft { + +/** + * @brief Exception thrown when a CUDA error is encountered. + */ +struct cuda_error : public raft::exception { + explicit cuda_error(char const* const message) : raft::exception(message) {} + explicit cuda_error(std::string const& message) : raft::exception(message) {} +}; + +} // namespace raft + +/** + * @brief Error checking macro for CUDA runtime API functions. + * + * Invokes a CUDA runtime API function call, if the call does not return + * cudaSuccess, invokes cudaGetLastError() to clear the error and throws an + * exception detailing the CUDA error that occurred + * + */ +#define RAFT_CUDA_TRY(call) \ + do { \ + cudaError_t const status = call; \ + if (status != cudaSuccess) { \ + cudaGetLastError(); \ + std::string msg{}; \ + SET_ERROR_MSG(msg, \ + "CUDA error encountered at: ", \ + "call='%s', Reason=%s:%s", \ + #call, \ + cudaGetErrorName(status), \ + cudaGetErrorString(status)); \ + throw raft::cuda_error(msg); \ + } \ + } while (0) + +// FIXME: Remove after consumers rename +#ifndef CUDA_TRY +#define CUDA_TRY(call) RAFT_CUDA_TRY(call) +#endif + +/** + * @brief Debug macro to check for CUDA errors + * + * In a non-release build, this macro will synchronize the specified stream + * before error checking. In both release and non-release builds, this macro + * checks for any pending CUDA errors from previous calls. If an error is + * reported, an exception is thrown detailing the CUDA error that occurred. + * + * The intent of this macro is to provide a mechanism for synchronous and + * deterministic execution for debugging asynchronous CUDA execution. It should + * be used after any asynchronous CUDA call, e.g., cudaMemcpyAsync, or an + * asynchronous kernel launch. + */ +#ifndef NDEBUG +#define RAFT_CHECK_CUDA(stream) RAFT_CUDA_TRY(cudaStreamSynchronize(stream)); +#else +#define RAFT_CHECK_CUDA(stream) RAFT_CUDA_TRY(cudaPeekAtLastError()); +#endif + +// FIXME: Remove after consumers rename +#ifndef CHECK_CUDA +#define CHECK_CUDA(call) RAFT_CHECK_CUDA(call) +#endif + +/** FIXME: remove after cuml rename */ +#ifndef CUDA_CHECK +#define CUDA_CHECK(call) RAFT_CUDA_TRY(call) +#endif + +// /** +// * @brief check for cuda runtime API errors but log error instead of raising +// * exception. +// */ +#define RAFT_CUDA_TRY_NO_THROW(call) \ + do { \ + cudaError_t const status = call; \ + if (cudaSuccess != status) { \ + printf("CUDA call='%s' at file=%s line=%d failed with %s\n", \ + #call, \ + __FILE__, \ + __LINE__, \ + cudaGetErrorString(status)); \ + } \ + } while (0) + +// FIXME: Remove after cuml rename +#ifndef CUDA_CHECK_NO_THROW +#define CUDA_CHECK_NO_THROW(call) RAFT_CUDA_TRY_NO_THROW(call) +#endif + +/** + * Alias to raft scope for now. + * TODO: Rename original implementations in 22.04 to fix + * https://github.com/rapidsai/raft/issues/128 + */ + +namespace raft { + +/** Helper method to get to know warp size in device code */ +__host__ __device__ constexpr inline int warp_size() { return 32; } + +__host__ __device__ constexpr inline unsigned int warp_full_mask() { return 0xffffffff; } + +/** + * @brief A kernel grid configuration construction gadget for simple one-dimensional mapping + * elements to threads. + */ +class grid_1d_thread_t { + public: + int const block_size{0}; + int const num_blocks{0}; + + /** + * @param overall_num_elements The number of elements the kernel needs to handle/process + * @param num_threads_per_block The grid block size, determined according to the kernel's + * specific features (amount of shared memory necessary, SM functional units use pattern etc.); + * this can't be determined generically/automatically (as opposed to the number of blocks) + * @param max_num_blocks_1d maximum number of blocks in 1d grid + * @param elements_per_thread Typically, a single kernel thread processes more than a single + * element; this affects the number of threads the grid must contain + */ + grid_1d_thread_t(size_t overall_num_elements, + size_t num_threads_per_block, + size_t max_num_blocks_1d, + size_t elements_per_thread = 1) + : block_size(num_threads_per_block), + num_blocks( + std::min((overall_num_elements + (elements_per_thread * num_threads_per_block) - 1) / + (elements_per_thread * num_threads_per_block), + max_num_blocks_1d)) + { + RAFT_EXPECTS(overall_num_elements > 0, "overall_num_elements must be > 0"); + RAFT_EXPECTS(num_threads_per_block / warp_size() > 0, + "num_threads_per_block / warp_size() must be > 0"); + RAFT_EXPECTS(elements_per_thread > 0, "elements_per_thread must be > 0"); + } +}; + +/** + * @brief A kernel grid configuration construction gadget for simple one-dimensional mapping + * elements to warps. + */ +class grid_1d_warp_t { + public: + int const block_size{0}; + int const num_blocks{0}; + + /** + * @param overall_num_elements The number of elements the kernel needs to handle/process + * @param num_threads_per_block The grid block size, determined according to the kernel's + * specific features (amount of shared memory necessary, SM functional units use pattern etc.); + * this can't be determined generically/automatically (as opposed to the number of blocks) + * @param max_num_blocks_1d maximum number of blocks in 1d grid + */ + grid_1d_warp_t(size_t overall_num_elements, + size_t num_threads_per_block, + size_t max_num_blocks_1d) + : block_size(num_threads_per_block), + num_blocks(std::min((overall_num_elements + (num_threads_per_block / warp_size()) - 1) / + (num_threads_per_block / warp_size()), + max_num_blocks_1d)) + { + RAFT_EXPECTS(overall_num_elements > 0, "overall_num_elements must be > 0"); + RAFT_EXPECTS(num_threads_per_block / warp_size() > 0, + "num_threads_per_block / warp_size() must be > 0"); + } +}; + +/** + * @brief A kernel grid configuration construction gadget for simple one-dimensional mapping + * elements to blocks. + */ +class grid_1d_block_t { + public: + int const block_size{0}; + int const num_blocks{0}; + + /** + * @param overall_num_elements The number of elements the kernel needs to handle/process + * @param num_threads_per_block The grid block size, determined according to the kernel's + * specific features (amount of shared memory necessary, SM functional units use pattern etc.); + * this can't be determined generically/automatically (as opposed to the number of blocks) + * @param max_num_blocks_1d maximum number of blocks in 1d grid + */ + grid_1d_block_t(size_t overall_num_elements, + size_t num_threads_per_block, + size_t max_num_blocks_1d) + : block_size(num_threads_per_block), + num_blocks(std::min(overall_num_elements, max_num_blocks_1d)) + { + RAFT_EXPECTS(overall_num_elements > 0, "overall_num_elements must be > 0"); + RAFT_EXPECTS(num_threads_per_block / warp_size() > 0, + "num_threads_per_block / warp_size() must be > 0"); + } +}; + +/** + * @brief Generic copy method for all kinds of transfers + * @tparam Type data type + * @param dst destination pointer + * @param src source pointer + * @param len lenth of the src/dst buffers in terms of number of elements + * @param stream cuda stream + */ +template +void copy(Type* dst, const Type* src, size_t len, rmm::cuda_stream_view stream) +{ + CUDA_CHECK(cudaMemcpyAsync(dst, src, len * sizeof(Type), cudaMemcpyDefault, stream)); +} + +/** + * @defgroup Copy Copy methods + * These are here along with the generic 'copy' method in order to improve + * code readability using explicitly specified function names + * @{ + */ +/** performs a host to device copy */ +template +void update_device(Type* d_ptr, const Type* h_ptr, size_t len, rmm::cuda_stream_view stream) +{ + copy(d_ptr, h_ptr, len, stream); +} + +/** performs a device to host copy */ +template +void update_host(Type* h_ptr, const Type* d_ptr, size_t len, rmm::cuda_stream_view stream) +{ + copy(h_ptr, d_ptr, len, stream); +} + +template +void copy_async(Type* d_ptr1, const Type* d_ptr2, size_t len, rmm::cuda_stream_view stream) +{ + CUDA_CHECK(cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream)); +} +/** @} */ + +/** + * @defgroup Debug Utils for debugging host/device buffers + * @{ + */ +template +void print_host_vector(const char* variable_name, + const T* host_mem, + size_t componentsCount, + OutStream& out) +{ + out << variable_name << "=["; + for (size_t i = 0; i < componentsCount; ++i) { + if (i != 0) out << ","; + out << host_mem[i]; + } + out << "];" << std::endl; +} + +template +void print_device_vector(const char* variable_name, + const T* devMem, + size_t componentsCount, + OutStream& out) +{ + auto host_mem = std::make_unique(componentsCount); + CUDA_CHECK( + cudaMemcpy(host_mem.get(), devMem, componentsCount * sizeof(T), cudaMemcpyDeviceToHost)); + print_host_vector(variable_name, host_mem.get(), componentsCount, out); +} + +/** + * @brief Print an array given a device or a host pointer. + * + * @param[in] variable_name + * @param[in] ptr any pointer (device/host/managed, etc) + * @param[in] componentsCount array length + * @param out the output stream + */ +template +void print_vector(const char* variable_name, const T* ptr, size_t componentsCount, OutStream& out) +{ + cudaPointerAttributes attr; + RAFT_CUDA_TRY(cudaPointerGetAttributes(&attr, ptr)); + if (attr.hostPointer != nullptr) { + print_host_vector(variable_name, reinterpret_cast(attr.hostPointer), componentsCount, out); + } else if (attr.type == cudaMemoryTypeUnregistered) { + print_host_vector(variable_name, ptr, componentsCount, out); + } else { + print_device_vector(variable_name, ptr, componentsCount, out); + } +} +/** @} */ + +/** helper method to get max usable shared mem per block parameter */ +inline int getSharedMemPerBlock() +{ + int devId; + RAFT_CUDA_TRY(cudaGetDevice(&devId)); + int smemPerBlk; + RAFT_CUDA_TRY(cudaDeviceGetAttribute(&smemPerBlk, cudaDevAttrMaxSharedMemoryPerBlock, devId)); + return smemPerBlk; +} + +/** helper method to get multi-processor count parameter */ +inline int getMultiProcessorCount() +{ + int devId; + RAFT_CUDA_TRY(cudaGetDevice(&devId)); + int mpCount; + RAFT_CUDA_TRY(cudaDeviceGetAttribute(&mpCount, cudaDevAttrMultiProcessorCount, devId)); + return mpCount; +} + +/** helper method to convert an array on device to a string on host */ +template +std::string arr2Str(const T* arr, int size, std::string name, cudaStream_t stream, int width = 4) +{ + std::stringstream ss; + + T* arr_h = (T*)malloc(size * sizeof(T)); + update_host(arr_h, arr, size, stream); + RAFT_CUDA_TRY(cudaStreamSynchronize(stream)); + + ss << name << " = [ "; + for (int i = 0; i < size; i++) { + ss << std::setw(width) << arr_h[i]; + + if (i < size - 1) ss << ", "; + } + ss << " ]" << std::endl; + + free(arr_h); + + return ss.str(); +} + +/** this seems to be unused, but may be useful in the future */ +template +void ASSERT_DEVICE_MEM(T* ptr, std::string name) +{ + cudaPointerAttributes s_att; + cudaError_t s_err = cudaPointerGetAttributes(&s_att, ptr); + + if (s_err != 0 || s_att.device == -1) + std::cout << "Invalid device pointer encountered in " << name << ". device=" << s_att.device + << ", err=" << s_err << std::endl; +} + +inline uint32_t curTimeMillis() +{ + auto now = std::chrono::high_resolution_clock::now(); + auto duration = now.time_since_epoch(); + return std::chrono::duration_cast(duration).count(); +} + +/** Helper function to calculate need memory for allocate to store dense matrix. + * @param rows number of rows in matrix + * @param columns number of columns in matrix + * @return need number of items to allocate via allocate() + * @sa allocate() + */ +inline size_t allocLengthForMatrix(size_t rows, size_t columns) { return rows * columns; } + +/** Helper function to check alignment of pointer. + * @param ptr the pointer to check + * @param alignment to be checked for + * @return true if address in bytes is a multiple of alignment + */ +template +bool is_aligned(Type* ptr, size_t alignment) +{ + return reinterpret_cast(ptr) % alignment == 0; +} + +/** calculate greatest common divisor of two numbers + * @a integer + * @b integer + * @ return gcd of a and b + */ +template +IntType gcd(IntType a, IntType b) +{ + while (b != 0) { + IntType tmp = b; + b = a % b; + a = tmp; + } + return a; +} + +template +constexpr T lower_bound() +{ + if constexpr (std::numeric_limits::has_infinity && std::numeric_limits::is_signed) { + return -std::numeric_limits::infinity(); + } + return std::numeric_limits::lowest(); +} + +template +constexpr T upper_bound() +{ + if constexpr (std::numeric_limits::has_infinity) { return std::numeric_limits::infinity(); } + return std::numeric_limits::max(); +} + +/** + * @brief Get a pointer to a pooled memory resource within the scope of the lifetime of the returned + * unique pointer. + * + * This function is useful in the code where multiple repeated allocations/deallocations are + * expected. + * Use case example: + * @code{.cpp} + * void my_func(..., size_t n, rmm::mr::device_memory_resource* mr = nullptr) { + * auto pool_guard = raft::get_pool_memory_resource(mr, 2 * n * sizeof(float)); + * if (pool_guard){ + * RAFT_LOG_INFO("Created a pool %zu bytes", pool_guard->pool_size()); + * } else { + * RAFT_LOG_INFO("Using the current default or explicitly passed device memory resource"); + * } + * rmm::device_uvector x(n, stream, mr); + * rmm::device_uvector y(n, stream, mr); + * ... + * } + * @endcode + * Here, the new memory resource would be created within the function scope if the passed `mr` is + * null and the default resource is not a pool. After the call, `mr` contains a valid memory + * resource in any case. + * + * @param[inout] mr if not null do nothing; otherwise get the current device resource and wrap it + * into a `pool_memory_resource` if neccessary and return the pointer to the result. + * @param initial_size if a new memory pool is created, this would be its initial size (rounded up + * to 256 bytes). + * + * @return if a new memory pool is created, it returns a unique_ptr to it; + * this managed pointer controls the lifetime of the created memory resource. + */ +inline auto get_pool_memory_resource(rmm::mr::device_memory_resource*& mr, size_t initial_size) +{ + using pool_res_t = rmm::mr::pool_memory_resource; + std::unique_ptr pool_res{}; + if (mr) return pool_res; + mr = rmm::mr::get_current_device_resource(); + if (!dynamic_cast(mr) && + !dynamic_cast*>(mr) && + !dynamic_cast*>(mr)) { + pool_res = std::make_unique(mr, (initial_size + 255) & (~255)); + mr = pool_res.get(); + } + return pool_res; +} + +} // namespace raft + +#endif diff --git a/cpp/include/raft/util/detail/cub_wrappers.cuh b/cpp/include/raft/util/detail/cub_wrappers.cuh new file mode 100644 index 0000000000..8c70331165 --- /dev/null +++ b/cpp/include/raft/util/detail/cub_wrappers.cuh @@ -0,0 +1,53 @@ +/* + * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include + +namespace raft { + +/** + * @brief Convenience wrapper over cub's SortPairs method + * @tparam KeyT key type + * @tparam ValueT value type + * @param workspace workspace buffer which will get resized if not enough space + * @param inKeys input keys array + * @param outKeys output keys array + * @param inVals input values array + * @param outVals output values array + * @param len array length + * @param stream cuda stream + */ +template +void sortPairs(rmm::device_uvector& workspace, + const KeyT* inKeys, + KeyT* outKeys, + const ValueT* inVals, + ValueT* outVals, + int len, + cudaStream_t stream) +{ + size_t worksize; + cub::DeviceRadixSort::SortPairs( + nullptr, worksize, inKeys, outKeys, inVals, outVals, len, 0, sizeof(KeyT) * 8, stream); + workspace.resize(worksize, stream); + cub::DeviceRadixSort::SortPairs( + workspace.data(), worksize, inKeys, outKeys, inVals, outVals, len, 0, sizeof(KeyT) * 8, stream); +} + +} // namespace raft diff --git a/cpp/include/raft/util/detail/scatter.cuh b/cpp/include/raft/util/detail/scatter.cuh new file mode 100644 index 0000000000..87a8826aa6 --- /dev/null +++ b/cpp/include/raft/util/detail/scatter.cuh @@ -0,0 +1,52 @@ +/* + * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include + +namespace raft::detail { + +template +__global__ void scatterKernel(DataT* out, const DataT* in, const IdxT* idx, IdxT len, Lambda op) +{ + typedef TxN_t DataVec; + typedef TxN_t IdxVec; + IdxT tid = threadIdx.x + ((IdxT)blockIdx.x * blockDim.x); + tid *= VecLen; + if (tid >= len) return; + IdxVec idxIn; + idxIn.load(idx, tid); + DataVec dataIn; +#pragma unroll + for (int i = 0; i < VecLen; ++i) { + auto inPos = idxIn.val.data[i]; + dataIn.val.data[i] = op(in[inPos], tid + i); + } + dataIn.store(out, tid); +} + +template +void scatterImpl( + DataT* out, const DataT* in, const IdxT* idx, IdxT len, Lambda op, cudaStream_t stream) +{ + const IdxT nblks = raft::ceildiv(VecLen ? len / VecLen : len, (IdxT)TPB); + scatterKernel<<>>(out, in, idx, len, op); + RAFT_CUDA_TRY(cudaGetLastError()); +} + +} // namespace raft::detail diff --git a/cpp/include/raft/util/device_atomics.cuh b/cpp/include/raft/util/device_atomics.cuh new file mode 100644 index 0000000000..28f7516688 --- /dev/null +++ b/cpp/include/raft/util/device_atomics.cuh @@ -0,0 +1,668 @@ +/* + * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +/** + * @brief overloads for CUDA atomic operations + * @file device_atomics.cuh + * + * Provides the overloads for arithmetic data types, where CUDA atomic operations are, `atomicAdd`, + * `atomicMin`, `atomicMax`, and `atomicCAS`. + * `atomicAnd`, `atomicOr`, `atomicXor` are also supported for integer data types. + * Also provides `raft::genericAtomicOperation` which performs atomic operation with the given + * binary operator. + */ + +#include +#include + +namespace raft { + +namespace device_atomics { +namespace detail { + +// ------------------------------------------------------------------------------------------------- +// Binary operators + +/* @brief binary `sum` operator */ +struct DeviceSum { + template ::value>* = nullptr> + __device__ T operator()(const T& lhs, const T& rhs) + { + return lhs + rhs; + } +}; + +/* @brief binary `min` operator */ +struct DeviceMin { + template + __device__ T operator()(const T& lhs, const T& rhs) + { + return lhs < rhs ? lhs : rhs; + } +}; + +/* @brief binary `max` operator */ +struct DeviceMax { + template + __device__ T operator()(const T& lhs, const T& rhs) + { + return lhs > rhs ? lhs : rhs; + } +}; + +/* @brief binary `product` operator */ +struct DeviceProduct { + template ::value>* = nullptr> + __device__ T operator()(const T& lhs, const T& rhs) + { + return lhs * rhs; + } +}; + +/* @brief binary `and` operator */ +struct DeviceAnd { + template ::value>* = nullptr> + __device__ T operator()(const T& lhs, const T& rhs) + { + return (lhs & rhs); + } +}; + +/* @brief binary `or` operator */ +struct DeviceOr { + template ::value>* = nullptr> + __device__ T operator()(const T& lhs, const T& rhs) + { + return (lhs | rhs); + } +}; + +/* @brief binary `xor` operator */ +struct DeviceXor { + template ::value>* = nullptr> + __device__ T operator()(const T& lhs, const T& rhs) + { + return (lhs ^ rhs); + } +}; + +// FIXME: remove this if C++17 is supported. +// `static_assert` requires a string literal at C++14. +#define errmsg_cast "size mismatch." + +template +__forceinline__ __device__ T_output type_reinterpret(T_input value) +{ + static_assert(sizeof(T_output) == sizeof(T_input), "type_reinterpret for different size"); + return *(reinterpret_cast(&value)); +} + +// ------------------------------------------------------------------------------------------------- +// the implementation of `genericAtomicOperation` + +template +struct genericAtomicOperationImpl; + +// single byte atomic operation +template +struct genericAtomicOperationImpl { + __forceinline__ __device__ T operator()(T* addr, T const& update_value, Op op) + { + using T_int = unsigned int; + + T_int* address_uint32 = reinterpret_cast(addr - (reinterpret_cast(addr) & 3)); + T_int shift = ((reinterpret_cast(addr) & 3) * 8); + + T_int old = *address_uint32; + T_int assumed; + + do { + assumed = old; + T target_value = T((old >> shift) & 0xff); + uint8_t updating_value = type_reinterpret(op(target_value, update_value)); + T_int new_value = (old & ~(0x000000ff << shift)) | (T_int(updating_value) << shift); + old = atomicCAS(address_uint32, assumed, new_value); + } while (assumed != old); + + return T((old >> shift) & 0xff); + } +}; + +// 2 bytes atomic operation +template +struct genericAtomicOperationImpl { + __forceinline__ __device__ T operator()(T* addr, T const& update_value, Op op) + { + using T_int = unsigned int; + bool is_32_align = (reinterpret_cast(addr) & 2) ? false : true; + T_int* address_uint32 = + reinterpret_cast(reinterpret_cast(addr) - (is_32_align ? 0 : 2)); + + T_int old = *address_uint32; + T_int assumed; + + do { + assumed = old; + T target_value = (is_32_align) ? T(old & 0xffff) : T(old >> 16); + uint16_t updating_value = type_reinterpret(op(target_value, update_value)); + + T_int new_value = (is_32_align) ? (old & 0xffff0000) | updating_value + : (old & 0xffff) | (T_int(updating_value) << 16); + old = atomicCAS(address_uint32, assumed, new_value); + } while (assumed != old); + + return (is_32_align) ? T(old & 0xffff) : T(old >> 16); + ; + } +}; + +// 4 bytes atomic operation +template +struct genericAtomicOperationImpl { + __forceinline__ __device__ T operator()(T* addr, T const& update_value, Op op) + { + using T_int = unsigned int; + T old_value = *addr; + T assumed{old_value}; + + if constexpr (std::is_same{} && (std::is_same{})) { + if (isnan(update_value)) { return old_value; } + } + + do { + assumed = old_value; + const T new_value = op(old_value, update_value); + + T_int ret = atomicCAS(reinterpret_cast(addr), + type_reinterpret(assumed), + type_reinterpret(new_value)); + old_value = type_reinterpret(ret); + } while (assumed != old_value); + + return old_value; + } +}; + +// 4 bytes fp32 atomic Max operation +template <> +struct genericAtomicOperationImpl { + using T = float; + __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceMax op) + { + if (isnan(update_value)) { return *addr; } + + T old = (update_value >= 0) + ? __int_as_float(atomicMax((int*)addr, __float_as_int(update_value))) + : __uint_as_float(atomicMin((unsigned int*)addr, __float_as_uint(update_value))); + + return old; + } +}; + +// 8 bytes atomic operation +template +struct genericAtomicOperationImpl { + __forceinline__ __device__ T operator()(T* addr, T const& update_value, Op op) + { + using T_int = unsigned long long int; + static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); + + T old_value = *addr; + T assumed{old_value}; + + do { + assumed = old_value; + const T new_value = op(old_value, update_value); + + T_int ret = atomicCAS(reinterpret_cast(addr), + type_reinterpret(assumed), + type_reinterpret(new_value)); + old_value = type_reinterpret(ret); + + } while (assumed != old_value); + + return old_value; + } +}; + +// ------------------------------------------------------------------------------------------------- +// specialized functions for operators +// `atomicAdd` supports int, unsigned int, unsigend long long int, float, double (long long int is +// not supproted.) `atomicMin`, `atomicMax` support int, unsigned int, unsigned long long int +// `atomicAnd`, `atomicOr`, `atomicXor` support int, unsigned int, unsigned long long int + +// CUDA natively supports `unsigned long long int` for `atomicAdd`, +// but doesn't supports `long int`. +// However, since the signed integer is represented as Two's complement, +// the fundamental arithmetic operations of addition are identical to +// those for unsigned binary numbers. +// Then, this computes as `unsigned long long int` with `atomicAdd` +// @sa https://en.wikipedia.org/wiki/Two%27s_complement +template <> +struct genericAtomicOperationImpl { + using T = long int; + __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceSum op) + { + using T_int = unsigned long long int; + static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); + T_int ret = atomicAdd(reinterpret_cast(addr), type_reinterpret(update_value)); + return type_reinterpret(ret); + } +}; + +template <> +struct genericAtomicOperationImpl { + using T = unsigned long int; + __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceSum op) + { + using T_int = unsigned long long int; + static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); + T_int ret = atomicAdd(reinterpret_cast(addr), type_reinterpret(update_value)); + return type_reinterpret(ret); + } +}; + +// CUDA natively supports `unsigned long long int` for `atomicAdd`, +// but doesn't supports `long long int`. +// However, since the signed integer is represented as Two's complement, +// the fundamental arithmetic operations of addition are identical to +// those for unsigned binary numbers. +// Then, this computes as `unsigned long long int` with `atomicAdd` +// @sa https://en.wikipedia.org/wiki/Two%27s_complement +template <> +struct genericAtomicOperationImpl { + using T = long long int; + __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceSum op) + { + using T_int = unsigned long long int; + static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); + T_int ret = atomicAdd(reinterpret_cast(addr), type_reinterpret(update_value)); + return type_reinterpret(ret); + } +}; + +template <> +struct genericAtomicOperationImpl { + using T = unsigned long int; + __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceMin op) + { + using T_int = unsigned long long int; + static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); + T ret = atomicMin(reinterpret_cast(addr), type_reinterpret(update_value)); + return type_reinterpret(ret); + } +}; + +template <> +struct genericAtomicOperationImpl { + using T = unsigned long int; + __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceMax op) + { + using T_int = unsigned long long int; + static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); + T ret = atomicMax(reinterpret_cast(addr), type_reinterpret(update_value)); + return type_reinterpret(ret); + } +}; + +template +struct genericAtomicOperationImpl { + __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceAnd op) + { + using T_int = unsigned long long int; + static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); + T_int ret = atomicAnd(reinterpret_cast(addr), type_reinterpret(update_value)); + return type_reinterpret(ret); + } +}; + +template +struct genericAtomicOperationImpl { + __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceOr op) + { + using T_int = unsigned long long int; + static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); + T_int ret = atomicOr(reinterpret_cast(addr), type_reinterpret(update_value)); + return type_reinterpret(ret); + } +}; + +template +struct genericAtomicOperationImpl { + __forceinline__ __device__ T operator()(T* addr, T const& update_value, DeviceXor op) + { + using T_int = unsigned long long int; + static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); + T_int ret = atomicXor(reinterpret_cast(addr), type_reinterpret(update_value)); + return type_reinterpret(ret); + } +}; + +// ------------------------------------------------------------------------------------------------- +// the implementation of `typesAtomicCASImpl` + +template +struct typesAtomicCASImpl; + +template +struct typesAtomicCASImpl { + __forceinline__ __device__ T operator()(T* addr, T const& compare, T const& update_value) + { + using T_int = unsigned int; + + T_int shift = ((reinterpret_cast(addr) & 3) * 8); + T_int* address_uint32 = reinterpret_cast(addr - (reinterpret_cast(addr) & 3)); + + // the 'target_value' in `old` can be different from `compare` + // because other thread may update the value + // before fetching a value from `address_uint32` in this function + T_int old = *address_uint32; + T_int assumed; + T target_value; + uint8_t u_val = type_reinterpret(update_value); + + do { + assumed = old; + target_value = T((old >> shift) & 0xff); + // have to compare `target_value` and `compare` before calling atomicCAS + // the `target_value` in `old` can be different with `compare` + if (target_value != compare) break; + + T_int new_value = (old & ~(0x000000ff << shift)) | (T_int(u_val) << shift); + old = atomicCAS(address_uint32, assumed, new_value); + } while (assumed != old); + + return target_value; + } +}; + +template +struct typesAtomicCASImpl { + __forceinline__ __device__ T operator()(T* addr, T const& compare, T const& update_value) + { + using T_int = unsigned int; + + bool is_32_align = (reinterpret_cast(addr) & 2) ? false : true; + T_int* address_uint32 = + reinterpret_cast(reinterpret_cast(addr) - (is_32_align ? 0 : 2)); + + T_int old = *address_uint32; + T_int assumed; + T target_value; + uint16_t u_val = type_reinterpret(update_value); + + do { + assumed = old; + target_value = (is_32_align) ? T(old & 0xffff) : T(old >> 16); + if (target_value != compare) break; + + T_int new_value = + (is_32_align) ? (old & 0xffff0000) | u_val : (old & 0xffff) | (T_int(u_val) << 16); + old = atomicCAS(address_uint32, assumed, new_value); + } while (assumed != old); + + return target_value; + } +}; + +template +struct typesAtomicCASImpl { + __forceinline__ __device__ T operator()(T* addr, T const& compare, T const& update_value) + { + using T_int = unsigned int; + + T_int ret = atomicCAS(reinterpret_cast(addr), + type_reinterpret(compare), + type_reinterpret(update_value)); + return type_reinterpret(ret); + } +}; + +// 8 bytes atomic operation +template +struct typesAtomicCASImpl { + __forceinline__ __device__ T operator()(T* addr, T const& compare, T const& update_value) + { + using T_int = unsigned long long int; + static_assert(sizeof(T) == sizeof(T_int), errmsg_cast); + + T_int ret = atomicCAS(reinterpret_cast(addr), + type_reinterpret(compare), + type_reinterpret(update_value)); + + return type_reinterpret(ret); + } +}; + +} // namespace detail +} // namespace device_atomics + +/** -------------------------------------------------------------------------* + * @brief compute atomic binary operation + * reads the `old` located at the `address` in global or shared memory, + * computes 'BinaryOp'('old', 'update_value'), + * and stores the result back to memory at the same address. + * These three operations are performed in one atomic transaction. + * + * The supported cudf types for `genericAtomicOperation` are: + * int8_t, int16_t, int32_t, int64_t, float, double + * + * @param[in] address The address of old value in global or shared memory + * @param[in] update_value The value to be computed + * @param[in] op The binary operator used for compute + * + * @returns The old value at `address` + * -------------------------------------------------------------------------**/ +template +typename std::enable_if_t::value, T> __forceinline__ __device__ +genericAtomicOperation(T* address, T const& update_value, BinaryOp op) +{ + auto fun = raft::device_atomics::detail::genericAtomicOperationImpl{}; + return T(fun(address, update_value, op)); +} + +// specialization for bool types +template +__forceinline__ __device__ bool genericAtomicOperation(bool* address, + bool const& update_value, + BinaryOp op) +{ + using T = bool; + // don't use underlying type to apply operation for bool + auto fun = raft::device_atomics::detail::genericAtomicOperationImpl{}; + return T(fun(address, update_value, op)); +} + +} // namespace raft + +/** + * @brief Overloads for `atomicAdd` + * + * reads the `old` located at the `address` in global or shared memory, computes (old + val), and + * stores the result back to memory at the same address. These three operations are performed in one + * atomic transaction. + * + * The supported types for `atomicAdd` are: integers are floating point numbers. + * CUDA natively supports `int`, `unsigned int`, `unsigned long long int`, `float`, `double. + * + * @param[in] address The address of old value in global or shared memory + * @param[in] val The value to be added + * + * @returns The old value at `address` + */ +template +__forceinline__ __device__ T atomicAdd(T* address, T val) +{ + return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceSum{}); +} + +/** + * @brief Overloads for `atomicMin` + * + * reads the `old` located at the `address` in global or shared memory, computes the minimum of old + * and val, and stores the result back to memory at the same address. These three operations are + * performed in one atomic transaction. + * + * The supported types for `atomicMin` are: integers are floating point numbers. + * CUDA natively supports `int`, `unsigend int`, `unsigned long long int`. + * + * @param[in] address The address of old value in global or shared memory + * @param[in] val The value to be computed + * + * @returns The old value at `address` + */ +template +__forceinline__ __device__ T atomicMin(T* address, T val) +{ + return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceMin{}); +} + +/** + * @brief Overloads for `atomicMax` + * + * reads the `old` located at the `address` in global or shared memory, computes the maximum of old + * and val, and stores the result back to memory at the same address. These three operations are + * performed in one atomic transaction. + * + * The supported types for `atomicMax` are: integers are floating point numbers. + * CUDA natively supports `int`, `unsigend int`, `unsigned long long int`. + * + * @param[in] address The address of old value in global or shared memory + * @param[in] val The value to be computed + * + * @returns The old value at `address` + */ +template +__forceinline__ __device__ T atomicMax(T* address, T val) +{ + return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceMax{}); +} + +/** + * @brief Overloads for `atomicCAS` + * + * reads the `old` located at the `address` in global or shared memory, computes + * (`old` == `compare` ? `val` : `old`), and stores the result back to memory at the same address. + * These three operations are performed in one atomic transaction. + * + * The supported types for `atomicCAS` are: integers are floating point numbers. + * CUDA natively supports `int`, `unsigned int`, `unsigned long long int`, `unsigned short int`. + * + * @param[in] address The address of old value in global or shared memory + * @param[in] compare The value to be compared + * @param[in] val The value to be computed + * + * @returns The old value at `address` + */ +template +__forceinline__ __device__ T atomicCAS(T* address, T compare, T val) +{ + return raft::device_atomics::detail::typesAtomicCASImpl()(address, compare, val); +} + +/** + * @brief Overloads for `atomicAnd` + * + * reads the `old` located at the `address` in global or shared memory, computes (old & val), and + * stores the result back to memory at the same address. These three operations are performed in + * one atomic transaction. + * + * The supported types for `atomicAnd` are: integers. + * CUDA natively supports `int`, `unsigned int`, `unsigned long long int`. + * + * @param[in] address The address of old value in global or shared memory + * @param[in] val The value to be computed + * + * @returns The old value at `address` + */ +template ::value, T>* = nullptr> +__forceinline__ __device__ T atomicAnd(T* address, T val) +{ + return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceAnd{}); +} + +/** + * @brief Overloads for `atomicOr` + * + * reads the `old` located at the `address` in global or shared memory, computes (old | val), and + * stores the result back to memory at the same address. These three operations are performed in + * one atomic transaction. + * + * The supported types for `atomicOr` are: integers. + * CUDA natively supports `int`, `unsigned int`, `unsigned long long int`. + * + * @param[in] address The address of old value in global or shared memory + * @param[in] val The value to be computed + * + * @returns The old value at `address` + */ +template ::value, T>* = nullptr> +__forceinline__ __device__ T atomicOr(T* address, T val) +{ + return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceOr{}); +} + +/** + * @brief Overloads for `atomicXor` + * + * reads the `old` located at the `address` in global or shared memory, computes (old ^ val), and + * stores the result back to memory at the same address. These three operations are performed in + * one atomic transaction. + * + * The supported types for `atomicXor` are: integers. + * CUDA natively supports `int`, `unsigned int`, `unsigned long long int`. + * + * @param[in] address The address of old value in global or shared memory + * @param[in] val The value to be computed + * + * @returns The old value at `address` + */ +template ::value, T>* = nullptr> +__forceinline__ __device__ T atomicXor(T* address, T val) +{ + return raft::genericAtomicOperation(address, val, raft::device_atomics::detail::DeviceXor{}); +} + +/** + * @brief: Warp aggregated atomic increment + * + * increments an atomic counter using all active threads in a warp. The return + * value is the original value of the counter plus the rank of the calling + * thread. + * + * The use of atomicIncWarp is a performance optimization. It can reduce the + * amount of atomic memory traffic by a factor of 32. + * + * Adapted from: + * https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/ + * + * @tparam T An integral type + * @param[in,out] ctr The address of old value + * + * @return The old value of the counter plus the rank of the calling thread. + */ +template ::value, T>* = nullptr> +__device__ T atomicIncWarp(T* ctr) +{ + namespace cg = cooperative_groups; + auto g = cg::coalesced_threads(); + T warp_res; + if (g.thread_rank() == 0) { warp_res = atomicAdd(ctr, static_cast(g.size())); } + return g.shfl(warp_res, 0) + g.thread_rank(); +} diff --git a/cpp/include/raft/util/device_loads_stores.cuh b/cpp/include/raft/util/device_loads_stores.cuh new file mode 100644 index 0000000000..2b87c44d60 --- /dev/null +++ b/cpp/include/raft/util/device_loads_stores.cuh @@ -0,0 +1,538 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +namespace raft { + +/** + * @defgroup SmemStores Shared memory store operations + * @{ + * @brief Stores to shared memory (both vectorized and non-vectorized forms) + * requires the given shmem pointer to be aligned by the vector + length, like for float4 lds/sts shmem pointer should be aligned + by 16 bytes else it might silently fail or can also give + runtime error. + * @param[out] addr shared memory address (should be aligned to vector size) + * @param[in] x data to be stored at this address + */ +DI void sts(uint8_t* addr, const uint8_t& x) +{ + uint32_t x_int; + x_int = x; + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.u8 [%0], {%1};" : : "l"(s1), "r"(x_int)); +} +DI void sts(uint8_t* addr, const uint8_t (&x)[1]) +{ + uint32_t x_int[1]; + x_int[0] = x[0]; + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.u8 [%0], {%1};" : : "l"(s1), "r"(x_int[0])); +} +DI void sts(uint8_t* addr, const uint8_t (&x)[2]) +{ + uint32_t x_int[2]; + x_int[0] = x[0]; + x_int[1] = x[1]; + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v2.u8 [%0], {%1, %2};" : : "l"(s2), "r"(x_int[0]), "r"(x_int[1])); +} +DI void sts(uint8_t* addr, const uint8_t (&x)[4]) +{ + uint32_t x_int[4]; + x_int[0] = x[0]; + x_int[1] = x[1]; + x_int[2] = x[2]; + x_int[3] = x[3]; + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v4.u8 [%0], {%1, %2, %3, %4};" + : + : "l"(s4), "r"(x_int[0]), "r"(x_int[1]), "r"(x_int[2]), "r"(x_int[3])); +} + +DI void sts(int8_t* addr, const int8_t& x) +{ + int32_t x_int = x; + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.s8 [%0], {%1};" : : "l"(s1), "r"(x_int)); +} +DI void sts(int8_t* addr, const int8_t (&x)[1]) +{ + int32_t x_int[1]; + x_int[0] = x[0]; + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.s8 [%0], {%1};" : : "l"(s1), "r"(x_int[0])); +} +DI void sts(int8_t* addr, const int8_t (&x)[2]) +{ + int32_t x_int[2]; + x_int[0] = x[0]; + x_int[1] = x[1]; + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v2.s8 [%0], {%1, %2};" : : "l"(s2), "r"(x_int[0]), "r"(x_int[1])); +} +DI void sts(int8_t* addr, const int8_t (&x)[4]) +{ + int32_t x_int[4]; + x_int[0] = x[0]; + x_int[1] = x[1]; + x_int[2] = x[2]; + x_int[3] = x[3]; + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v4.s8 [%0], {%1, %2, %3, %4};" + : + : "l"(s4), "r"(x_int[0]), "r"(x_int[1]), "r"(x_int[2]), "r"(x_int[3])); +} + +DI void sts(uint32_t* addr, const uint32_t& x) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.u32 [%0], {%1};" : : "l"(s1), "r"(x)); +} +DI void sts(uint32_t* addr, const uint32_t (&x)[1]) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.u32 [%0], {%1};" : : "l"(s1), "r"(x[0])); +} +DI void sts(uint32_t* addr, const uint32_t (&x)[2]) +{ + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v2.u32 [%0], {%1, %2};" : : "l"(s2), "r"(x[0]), "r"(x[1])); +} +DI void sts(uint32_t* addr, const uint32_t (&x)[4]) +{ + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v4.u32 [%0], {%1, %2, %3, %4};" + : + : "l"(s4), "r"(x[0]), "r"(x[1]), "r"(x[2]), "r"(x[3])); +} + +DI void sts(int32_t* addr, const int32_t& x) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.u32 [%0], {%1};" : : "l"(s1), "r"(x)); +} +DI void sts(int32_t* addr, const int32_t (&x)[1]) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.u32 [%0], {%1};" : : "l"(s1), "r"(x[0])); +} +DI void sts(int32_t* addr, const int32_t (&x)[2]) +{ + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v2.u32 [%0], {%1, %2};" : : "l"(s2), "r"(x[0]), "r"(x[1])); +} +DI void sts(int32_t* addr, const int32_t (&x)[4]) +{ + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v4.u32 [%0], {%1, %2, %3, %4};" + : + : "l"(s4), "r"(x[0]), "r"(x[1]), "r"(x[2]), "r"(x[3])); +} + +DI void sts(float* addr, const float& x) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.f32 [%0], {%1};" : : "l"(s1), "f"(x)); +} +DI void sts(float* addr, const float (&x)[1]) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.f32 [%0], {%1};" : : "l"(s1), "f"(x[0])); +} +DI void sts(float* addr, const float (&x)[2]) +{ + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v2.f32 [%0], {%1, %2};" : : "l"(s2), "f"(x[0]), "f"(x[1])); +} +DI void sts(float* addr, const float (&x)[4]) +{ + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v4.f32 [%0], {%1, %2, %3, %4};" + : + : "l"(s4), "f"(x[0]), "f"(x[1]), "f"(x[2]), "f"(x[3])); +} + +DI void sts(double* addr, const double& x) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.f64 [%0], {%1};" : : "l"(s1), "d"(x)); +} +DI void sts(double* addr, const double (&x)[1]) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.f64 [%0], {%1};" : : "l"(s1), "d"(x[0])); +} +DI void sts(double* addr, const double (&x)[2]) +{ + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("st.shared.v2.f64 [%0], {%1, %2};" : : "l"(s2), "d"(x[0]), "d"(x[1])); +} +/** @} */ + +/** + * @defgroup SmemLoads Shared memory load operations + * @{ + * @brief Loads from shared memory (both vectorized and non-vectorized forms) + requires the given shmem pointer to be aligned by the vector + length, like for float4 lds/sts shmem pointer should be aligned + by 16 bytes else it might silently fail or can also give + runtime error. + * @param[out] x the data to be loaded + * @param[in] addr shared memory address from where to load + * (should be aligned to vector size) + */ + +DI void lds(uint8_t& x, const uint8_t* addr) +{ + uint32_t x_int; + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.u8 {%0}, [%1];" : "=r"(x_int) : "l"(s1)); + x = x_int; +} +DI void lds(uint8_t (&x)[1], const uint8_t* addr) +{ + uint32_t x_int[1]; + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.u8 {%0}, [%1];" : "=r"(x_int[0]) : "l"(s1)); + x[0] = x_int[0]; +} +DI void lds(uint8_t (&x)[2], const uint8_t* addr) +{ + uint32_t x_int[2]; + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v2.u8 {%0, %1}, [%2];" : "=r"(x_int[0]), "=r"(x_int[1]) : "l"(s2)); + x[0] = x_int[0]; + x[1] = x_int[1]; +} +DI void lds(uint8_t (&x)[4], const uint8_t* addr) +{ + uint32_t x_int[4]; + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v4.u8 {%0, %1, %2, %3}, [%4];" + : "=r"(x_int[0]), "=r"(x_int[1]), "=r"(x_int[2]), "=r"(x_int[3]) + : "l"(s4)); + x[0] = x_int[0]; + x[1] = x_int[1]; + x[2] = x_int[2]; + x[3] = x_int[3]; +} + +DI void lds(int8_t& x, const int8_t* addr) +{ + int32_t x_int; + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.s8 {%0}, [%1];" : "=r"(x_int) : "l"(s1)); + x = x_int; +} +DI void lds(int8_t (&x)[1], const int8_t* addr) +{ + int32_t x_int[1]; + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.s8 {%0}, [%1];" : "=r"(x_int[0]) : "l"(s1)); + x[0] = x_int[0]; +} +DI void lds(int8_t (&x)[2], const int8_t* addr) +{ + int32_t x_int[2]; + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v2.s8 {%0, %1}, [%2];" : "=r"(x_int[0]), "=r"(x_int[1]) : "l"(s2)); + x[0] = x_int[0]; + x[1] = x_int[1]; +} +DI void lds(int8_t (&x)[4], const int8_t* addr) +{ + int32_t x_int[4]; + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v4.s8 {%0, %1, %2, %3}, [%4];" + : "=r"(x_int[0]), "=r"(x_int[1]), "=r"(x_int[2]), "=r"(x_int[3]) + : "l"(s4)); + x[0] = x_int[0]; + x[1] = x_int[1]; + x[2] = x_int[2]; + x[3] = x_int[3]; +} + +DI void lds(uint32_t (&x)[4], const uint32_t* addr) +{ + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v4.u32 {%0, %1, %2, %3}, [%4];" + : "=r"(x[0]), "=r"(x[1]), "=r"(x[2]), "=r"(x[3]) + : "l"(s4)); +} + +DI void lds(uint32_t (&x)[2], const uint32_t* addr) +{ + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v2.u32 {%0, %1}, [%2];" : "=r"(x[0]), "=r"(x[1]) : "l"(s2)); +} + +DI void lds(uint32_t (&x)[1], const uint32_t* addr) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.u32 {%0}, [%1];" : "=r"(x[0]) : "l"(s1)); +} + +DI void lds(uint32_t& x, const uint32_t* addr) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.u32 {%0}, [%1];" : "=r"(x) : "l"(s1)); +} + +DI void lds(int32_t (&x)[4], const int32_t* addr) +{ + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v4.u32 {%0, %1, %2, %3}, [%4];" + : "=r"(x[0]), "=r"(x[1]), "=r"(x[2]), "=r"(x[3]) + : "l"(s4)); +} + +DI void lds(int32_t (&x)[2], const int32_t* addr) +{ + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v2.u32 {%0, %1}, [%2];" : "=r"(x[0]), "=r"(x[1]) : "l"(s2)); +} + +DI void lds(int32_t (&x)[1], const int32_t* addr) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.u32 {%0}, [%1];" : "=r"(x[0]) : "l"(s1)); +} + +DI void lds(int32_t& x, const int32_t* addr) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.u32 {%0}, [%1];" : "=r"(x) : "l"(s1)); +} + +DI void lds(float& x, const float* addr) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.f32 {%0}, [%1];" : "=f"(x) : "l"(s1)); +} +DI void lds(float (&x)[1], const float* addr) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.f32 {%0}, [%1];" : "=f"(x[0]) : "l"(s1)); +} +DI void lds(float (&x)[2], const float* addr) +{ + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v2.f32 {%0, %1}, [%2];" : "=f"(x[0]), "=f"(x[1]) : "l"(s2)); +} +DI void lds(float (&x)[4], const float* addr) +{ + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v4.f32 {%0, %1, %2, %3}, [%4];" + : "=f"(x[0]), "=f"(x[1]), "=f"(x[2]), "=f"(x[3]) + : "l"(s4)); +} + +DI void lds(float& x, float* addr) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.f32 {%0}, [%1];" : "=f"(x) : "l"(s1)); +} +DI void lds(float (&x)[1], float* addr) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.f32 {%0}, [%1];" : "=f"(x[0]) : "l"(s1)); +} +DI void lds(float (&x)[2], float* addr) +{ + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v2.f32 {%0, %1}, [%2];" : "=f"(x[0]), "=f"(x[1]) : "l"(s2)); +} +DI void lds(float (&x)[4], float* addr) +{ + auto s4 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v4.f32 {%0, %1, %2, %3}, [%4];" + : "=f"(x[0]), "=f"(x[1]), "=f"(x[2]), "=f"(x[3]) + : "l"(s4)); +} +DI void lds(double& x, double* addr) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.f64 {%0}, [%1];" : "=d"(x) : "l"(s1)); +} +DI void lds(double (&x)[1], double* addr) +{ + auto s1 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.f64 {%0}, [%1];" : "=d"(x[0]) : "l"(s1)); +} +DI void lds(double (&x)[2], double* addr) +{ + auto s2 = __cvta_generic_to_shared(reinterpret_cast(addr)); + asm volatile("ld.shared.v2.f64 {%0, %1}, [%2];" : "=d"(x[0]), "=d"(x[1]) : "l"(s2)); +} +/** @} */ + +/** + * @defgroup GlobalLoads Global cached load operations + * @{ + * @brief Load from global memory with caching at L1 level + * @param[out] x data to be loaded from global memory + * @param[in] addr address in global memory from where to load + */ +DI void ldg(float& x, const float* addr) +{ + asm volatile("ld.global.cg.f32 %0, [%1];" : "=f"(x) : "l"(addr)); +} +DI void ldg(float (&x)[1], const float* addr) +{ + asm volatile("ld.global.cg.f32 %0, [%1];" : "=f"(x[0]) : "l"(addr)); +} +DI void ldg(float (&x)[2], const float* addr) +{ + asm volatile("ld.global.cg.v2.f32 {%0, %1}, [%2];" : "=f"(x[0]), "=f"(x[1]) : "l"(addr)); +} +DI void ldg(float (&x)[4], const float* addr) +{ + asm volatile("ld.global.cg.v4.f32 {%0, %1, %2, %3}, [%4];" + : "=f"(x[0]), "=f"(x[1]), "=f"(x[2]), "=f"(x[3]) + : "l"(addr)); +} +DI void ldg(double& x, const double* addr) +{ + asm volatile("ld.global.cg.f64 %0, [%1];" : "=d"(x) : "l"(addr)); +} +DI void ldg(double (&x)[1], const double* addr) +{ + asm volatile("ld.global.cg.f64 %0, [%1];" : "=d"(x[0]) : "l"(addr)); +} +DI void ldg(double (&x)[2], const double* addr) +{ + asm volatile("ld.global.cg.v2.f64 {%0, %1}, [%2];" : "=d"(x[0]), "=d"(x[1]) : "l"(addr)); +} + +DI void ldg(uint32_t (&x)[4], const uint32_t* const& addr) +{ + asm volatile("ld.global.cg.v4.u32 {%0, %1, %2, %3}, [%4];" + : "=r"(x[0]), "=r"(x[1]), "=r"(x[2]), "=r"(x[3]) + : "l"(addr)); +} + +DI void ldg(uint32_t (&x)[2], const uint32_t* const& addr) +{ + asm volatile("ld.global.cg.v2.u32 {%0, %1}, [%2];" : "=r"(x[0]), "=r"(x[1]) : "l"(addr)); +} + +DI void ldg(uint32_t (&x)[1], const uint32_t* const& addr) +{ + asm volatile("ld.global.cg.u32 %0, [%1];" : "=r"(x[0]) : "l"(addr)); +} + +DI void ldg(uint32_t& x, const uint32_t* const& addr) +{ + asm volatile("ld.global.cg.u32 %0, [%1];" : "=r"(x) : "l"(addr)); +} + +DI void ldg(int32_t (&x)[4], const int32_t* const& addr) +{ + asm volatile("ld.global.cg.v4.u32 {%0, %1, %2, %3}, [%4];" + : "=r"(x[0]), "=r"(x[1]), "=r"(x[2]), "=r"(x[3]) + : "l"(addr)); +} + +DI void ldg(int32_t (&x)[2], const int32_t* const& addr) +{ + asm volatile("ld.global.cg.v2.u32 {%0, %1}, [%2];" : "=r"(x[0]), "=r"(x[1]) : "l"(addr)); +} + +DI void ldg(int32_t (&x)[1], const int32_t* const& addr) +{ + asm volatile("ld.global.cg.u32 %0, [%1];" : "=r"(x[0]) : "l"(addr)); +} + +DI void ldg(int32_t& x, const int32_t* const& addr) +{ + asm volatile("ld.global.cg.u32 %0, [%1];" : "=r"(x) : "l"(addr)); +} + +DI void ldg(uint8_t (&x)[4], const uint8_t* const& addr) +{ + uint32_t x_int[4]; + asm volatile("ld.global.cg.v4.u8 {%0, %1, %2, %3}, [%4];" + : "=r"(x_int[0]), "=r"(x_int[1]), "=r"(x_int[2]), "=r"(x_int[3]) + : "l"(addr)); + x[0] = x_int[0]; + x[1] = x_int[1]; + x[2] = x_int[2]; + x[3] = x_int[3]; +} + +DI void ldg(uint8_t (&x)[2], const uint8_t* const& addr) +{ + uint32_t x_int[2]; + asm volatile("ld.global.cg.v2.u8 {%0, %1}, [%2];" : "=r"(x_int[0]), "=r"(x_int[1]) : "l"(addr)); + x[0] = x_int[0]; + x[1] = x_int[1]; +} + +DI void ldg(uint8_t (&x)[1], const uint8_t* const& addr) +{ + uint32_t x_int; + asm volatile("ld.global.cg.u8 %0, [%1];" : "=r"(x_int) : "l"(addr)); + x[0] = x_int; +} + +DI void ldg(uint8_t& x, const uint8_t* const& addr) +{ + uint32_t x_int; + asm volatile("ld.global.cg.u8 %0, [%1];" : "=r"(x_int) : "l"(addr)); + x = x_int; +} + +DI void ldg(int8_t (&x)[4], const int8_t* const& addr) +{ + int x_int[4]; + asm volatile("ld.global.cg.v4.s8 {%0, %1, %2, %3}, [%4];" + : "=r"(x_int[0]), "=r"(x_int[1]), "=r"(x_int[2]), "=r"(x_int[3]) + : "l"(addr)); + x[0] = x_int[0]; + x[1] = x_int[1]; + x[2] = x_int[2]; + x[3] = x_int[3]; +} + +DI void ldg(int8_t (&x)[2], const int8_t* const& addr) +{ + int x_int[2]; + asm volatile("ld.global.cg.v2.s8 {%0, %1}, [%2];" : "=r"(x_int[0]), "=r"(x_int[1]) : "l"(addr)); + x[0] = x_int[0]; + x[1] = x_int[1]; +} + +DI void ldg(int8_t& x, const int8_t* const& addr) +{ + int x_int; + asm volatile("ld.global.cg.s8 %0, [%1];" : "=r"(x_int) : "l"(addr)); + x = x_int; +} + +DI void ldg(int8_t (&x)[1], const int8_t* const& addr) +{ + int x_int; + asm volatile("ld.global.cg.s8 %0, [%1];" : "=r"(x_int) : "l"(addr)); + x[0] = x_int; +} + +/** @} */ + +} // namespace raft diff --git a/cpp/include/raft/util/device_utils.cuh b/cpp/include/raft/util/device_utils.cuh new file mode 100644 index 0000000000..757c60a731 --- /dev/null +++ b/cpp/include/raft/util/device_utils.cuh @@ -0,0 +1,108 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include // pair + +namespace raft { + +// TODO move to raft https://github.com/rapidsai/raft/issues/90 +/** helper method to get the compute capability version numbers */ +inline std::pair getDeviceCapability() +{ + int devId; + RAFT_CUDA_TRY(cudaGetDevice(&devId)); + int major, minor; + RAFT_CUDA_TRY(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, devId)); + RAFT_CUDA_TRY(cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, devId)); + return std::make_pair(major, minor); +} + +/** + * @brief Batched warp-level sum reduction + * + * @tparam T data type + * @tparam NThreads Number of threads in the warp doing independent reductions + * + * @param[in] val input value + * @return for the first "group" of threads, the reduced value. All + * others will contain unusable values! + * + * @note Why not cub? Because cub doesn't seem to allow working with arbitrary + * number of warps in a block and also doesn't support this kind of + * batched reduction operation + * @note All threads in the warp must enter this function together + * + * @todo Expand this to support arbitrary reduction ops + */ +template +DI T batchedWarpReduce(T val) +{ +#pragma unroll + for (int i = NThreads; i < raft::WarpSize; i <<= 1) { + val += raft::shfl(val, raft::laneId() + i); + } + return val; +} + +/** + * @brief 1-D block-level batched sum reduction + * + * @tparam T data type + * @tparam NThreads Number of threads in the warp doing independent reductions + * + * @param val input value + * @param smem shared memory region needed for storing intermediate results. It + * must alteast be of size: `sizeof(T) * nWarps * NThreads` + * @return for the first "group" of threads in the block, the reduced value. + * All others will contain unusable values! + * + * @note Why not cub? Because cub doesn't seem to allow working with arbitrary + * number of warps in a block and also doesn't support this kind of + * batched reduction operation + * @note All threads in the block must enter this function together + * + * @todo Expand this to support arbitrary reduction ops + */ +template +DI T batchedBlockReduce(T val, char* smem) +{ + auto* sTemp = reinterpret_cast(smem); + constexpr int nGroupsPerWarp = raft::WarpSize / NThreads; + static_assert(raft::isPo2(nGroupsPerWarp), "nGroupsPerWarp must be a PO2!"); + const int nGroups = (blockDim.x + NThreads - 1) / NThreads; + const int lid = raft::laneId(); + const int lgid = lid % NThreads; + const int gid = threadIdx.x / NThreads; + const auto wrIdx = (gid / nGroupsPerWarp) * NThreads + lgid; + const auto rdIdx = gid * NThreads + lgid; + for (int i = nGroups; i > 0;) { + auto iAligned = ((i + nGroupsPerWarp - 1) / nGroupsPerWarp) * nGroupsPerWarp; + if (gid < iAligned) { + val = batchedWarpReduce(val); + if (lid < NThreads) sTemp[wrIdx] = val; + } + __syncthreads(); + i /= nGroupsPerWarp; + if (i > 0) { val = gid < i ? sTemp[rdIdx] : T(0); } + __syncthreads(); + } + return val; +} + +} // namespace raft diff --git a/cpp/include/raft/util/integer_utils.hpp b/cpp/include/raft/util/integer_utils.hpp new file mode 100644 index 0000000000..e893ff0904 --- /dev/null +++ b/cpp/include/raft/util/integer_utils.hpp @@ -0,0 +1,184 @@ +/* + * Copyright 2019 BlazingDB, Inc. + * Copyright 2019 Eyal Rozenberg + * Copyright (c) 2020-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +/** + * Utility code involving integer arithmetic + * + */ + +#include +#include + +namespace raft { +//! Utility functions +/** + * Finds the smallest integer not less than `number_to_round` and modulo `S` is + * zero. This function assumes that `number_to_round` is non-negative and + * `modulus` is positive. + */ +template +inline S round_up_safe(S number_to_round, S modulus) +{ + auto remainder = number_to_round % modulus; + if (remainder == 0) { return number_to_round; } + auto rounded_up = number_to_round - remainder + modulus; + if (rounded_up < number_to_round) { + throw std::invalid_argument("Attempt to round up beyond the type's maximum value"); + } + return rounded_up; +} + +/** + * Finds the largest integer not greater than `number_to_round` and modulo `S` is + * zero. This function assumes that `number_to_round` is non-negative and + * `modulus` is positive. + */ +template +inline S round_down_safe(S number_to_round, S modulus) +{ + auto remainder = number_to_round % modulus; + auto rounded_down = number_to_round - remainder; + return rounded_down; +} + +/** + * Divides the left-hand-side by the right-hand-side, rounding up + * to an integral multiple of the right-hand-side, e.g. (9,5) -> 2 , (10,5) -> 2, (11,5) -> 3. + * + * @param dividend the number to divide + * @param divisor the number by which to divide + * @return The least integer multiple of divisor which is greater than or equal to + * the non-integral division dividend/divisor. + * + * @note sensitive to overflow, i.e. if dividend > std::numeric_limits::max() - divisor, + * the result will be incorrect + */ +template +constexpr inline S div_rounding_up_unsafe(const S& dividend, const T& divisor) noexcept +{ + return (dividend + divisor - 1) / divisor; +} + +namespace detail { +template +constexpr inline I div_rounding_up_safe(std::integral_constant, + I dividend, + I divisor) noexcept +{ + // TODO: This could probably be implemented faster + return (dividend > divisor) ? 1 + div_rounding_up_unsafe(dividend - divisor, divisor) + : (dividend > 0); +} + +template +constexpr inline I div_rounding_up_safe(std::integral_constant, + I dividend, + I divisor) noexcept +{ + auto quotient = dividend / divisor; + auto remainder = dividend % divisor; + return quotient + (remainder != 0); +} + +} // namespace detail + +/** + * Divides the left-hand-side by the right-hand-side, rounding up + * to an integral multiple of the right-hand-side, e.g. (9,5) -> 2 , (10,5) -> 2, (11,5) -> 3. + * + * @param dividend the number to divide + * @param divisor the number of by which to divide + * @return The least integer multiple of divisor which is greater than or equal to + * the non-integral division dividend/divisor. + * + * @note will not overflow, and may _or may not_ be slower than the intuitive + * approach of using (dividend + divisor - 1) / divisor + */ +template +constexpr inline std::enable_if_t::value, I> div_rounding_up_safe( + I dividend, I divisor) noexcept +{ + using i_is_a_signed_type = std::integral_constant::value>; + return detail::div_rounding_up_safe(i_is_a_signed_type{}, dividend, divisor); +} + +template +constexpr inline std::enable_if_t::value, bool> is_a_power_of_two( + I val) noexcept +{ + return ((val - 1) & val) == 0; +} + +/** + * @brief Return the absolute value of a number. + * + * This calls `std::abs()` which performs equivalent: `(value < 0) ? -value : value`. + * + * This was created to prevent compile errors calling `std::abs()` with unsigned integers. + * An example compile error appears as follows: + * @code{.pseudo} + * error: more than one instance of overloaded function "std::abs" matches the argument list: + * function "abs(int)" + * function "std::abs(long)" + * function "std::abs(long long)" + * function "std::abs(double)" + * function "std::abs(float)" + * function "std::abs(long double)" + * argument types are: (uint64_t) + * @endcode + * + * Not all cases could be if-ed out using std::is_signed::value and satisfy the compiler. + * + * @param val Numeric value can be either integer or float type. + * @return Absolute value if value type is signed. + */ +template +std::enable_if_t::value, T> constexpr inline absolute_value(T val) +{ + return std::abs(val); +} +// Unsigned type just returns itself. +template +std::enable_if_t::value, T> constexpr inline absolute_value(T val) +{ + return val; +} + +/** + * @defgroup Check whether the numeric conversion is narrowing + * + * @tparam From source type + * @tparam To destination type + * @{ + */ +template +struct is_narrowing : std::true_type { +}; + +template +struct is_narrowing()})>> : std::false_type { +}; +/** @} */ + +/** Check whether the numeric conversion is narrowing */ +template +inline constexpr bool is_narrowing_v = is_narrowing::value; // NOLINT + +} // namespace raft diff --git a/cpp/include/raft/util/pow2_utils.cuh b/cpp/include/raft/util/pow2_utils.cuh new file mode 100644 index 0000000000..3b42682816 --- /dev/null +++ b/cpp/include/raft/util/pow2_utils.cuh @@ -0,0 +1,164 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +namespace raft { + +/** + * @brief Fast arithmetics and alignment checks for power-of-two values known at compile time. + * + * @tparam Value_ a compile-time value representable as a power-of-two. + */ +template +struct Pow2 { + typedef decltype(Value_) Type; + static constexpr Type Value = Value_; + static constexpr Type Log2 = log2(Value); + static constexpr Type Mask = Value - 1; + + static_assert(std::is_integral::value, "Value must be integral."); + static_assert(Value && !(Value & Mask), "Value must be power of two."); + +#define Pow2_FUNC_QUALIFIER static constexpr __host__ __device__ __forceinline__ +#define Pow2_WHEN_INTEGRAL(I) std::enable_if_t +#define Pow2_IS_REPRESENTABLE_AS(I) (std::is_integral::value && Type(I(Value)) == Value) + + /** + * Integer division by Value truncated toward zero + * (same as `x / Value` in C++). + * + * Invariant: `x = Value * quot(x) + rem(x)` + */ + template + Pow2_FUNC_QUALIFIER Pow2_WHEN_INTEGRAL(I) quot(I x) noexcept + { + if constexpr (std::is_signed::value) return (x >> I(Log2)) + (x < 0 && (x & I(Mask))); + if constexpr (std::is_unsigned::value) return x >> I(Log2); + } + + /** + * Remainder of integer division by Value truncated toward zero + * (same as `x % Value` in C++). + * + * Invariant: `x = Value * quot(x) + rem(x)`. + */ + template + Pow2_FUNC_QUALIFIER Pow2_WHEN_INTEGRAL(I) rem(I x) noexcept + { + if constexpr (std::is_signed::value) return x < 0 ? -((-x) & I(Mask)) : (x & I(Mask)); + if constexpr (std::is_unsigned::value) return x & I(Mask); + } + + /** + * Integer division by Value truncated toward negative infinity + * (same as `x // Value` in Python). + * + * Invariant: `x = Value * div(x) + mod(x)`. + * + * Note, `div` and `mod` for negative values are slightly faster + * than `quot` and `rem`, but behave slightly different + * compared to normal C++ operators `/` and `%`. + */ + template + Pow2_FUNC_QUALIFIER Pow2_WHEN_INTEGRAL(I) div(I x) noexcept + { + return x >> I(Log2); + } + + /** + * x modulo Value operation (remainder of the `div(x)`) + * (same as `x % Value` in Python). + * + * Invariant: `mod(x) >= 0` + * Invariant: `x = Value * div(x) + mod(x)`. + * + * Note, `div` and `mod` for negative values are slightly faster + * than `quot` and `rem`, but behave slightly different + * compared to normal C++ operators `/` and `%`. + */ + template + Pow2_FUNC_QUALIFIER Pow2_WHEN_INTEGRAL(I) mod(I x) noexcept + { + return x & I(Mask); + } + +#define Pow2_CHECK_TYPE(T) \ + static_assert(std::is_pointer::value || std::is_integral::value, \ + "Only pointer or integral types make sense here") + + /** + * Tell whether the pointer or integral is Value-aligned. + * NB: for pointers, the alignment is checked in bytes, not in elements. + */ + template + Pow2_FUNC_QUALIFIER bool isAligned(PtrT p) noexcept + { + Pow2_CHECK_TYPE(PtrT); + if constexpr (Pow2_IS_REPRESENTABLE_AS(PtrT)) return mod(p) == 0; + if constexpr (!Pow2_IS_REPRESENTABLE_AS(PtrT)) return mod(reinterpret_cast(p)) == 0; + } + + /** Tell whether two pointers have the same address modulo Value. */ + template + Pow2_FUNC_QUALIFIER bool areSameAlignOffsets(PtrT a, PtrS b) noexcept + { + Pow2_CHECK_TYPE(PtrT); + Pow2_CHECK_TYPE(PtrS); + Type x, y; + if constexpr (Pow2_IS_REPRESENTABLE_AS(PtrT)) + x = Type(mod(a)); + else + x = mod(reinterpret_cast(a)); + if constexpr (Pow2_IS_REPRESENTABLE_AS(PtrS)) + y = Type(mod(b)); + else + y = mod(reinterpret_cast(b)); + return x == y; + } + + /** Get this or next Value-aligned address (in bytes) or integral. */ + template + Pow2_FUNC_QUALIFIER PtrT roundUp(PtrT p) noexcept + { + Pow2_CHECK_TYPE(PtrT); + if constexpr (Pow2_IS_REPRESENTABLE_AS(PtrT)) return (p + PtrT(Mask)) & PtrT(~Mask); + if constexpr (!Pow2_IS_REPRESENTABLE_AS(PtrT)) { + auto x = reinterpret_cast(p); + return reinterpret_cast((x + Mask) & (~Mask)); + } + } + + /** Get this or previous Value-aligned address (in bytes) or integral. */ + template + Pow2_FUNC_QUALIFIER PtrT roundDown(PtrT p) noexcept + { + Pow2_CHECK_TYPE(PtrT); + if constexpr (Pow2_IS_REPRESENTABLE_AS(PtrT)) return p & PtrT(~Mask); + if constexpr (!Pow2_IS_REPRESENTABLE_AS(PtrT)) { + auto x = reinterpret_cast(p); + return reinterpret_cast(x & (~Mask)); + } + } +#undef Pow2_CHECK_TYPE +#undef Pow2_IS_REPRESENTABLE_AS +#undef Pow2_FUNC_QUALIFIER +#undef Pow2_WHEN_INTEGRAL +}; + +}; // namespace raft diff --git a/cpp/include/raft/util/scatter.cuh b/cpp/include/raft/util/scatter.cuh new file mode 100644 index 0000000000..c20afa5454 --- /dev/null +++ b/cpp/include/raft/util/scatter.cuh @@ -0,0 +1,68 @@ +/* + * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include + +namespace raft { + +/** + * @brief Performs scatter operation based on the input indexing array + * @tparam DataT data type whose array gets scattered + * @tparam IdxT indexing type + * @tparam TPB threads-per-block in the final kernel launched + * @tparam Lambda the device-lambda performing a unary operation on the loaded + * data before it gets scattered + * @param out the output array + * @param in the input array + * @param idx the indexing array + * @param len number of elements in the input array + * @param stream cuda stream where to launch work + * @param op the device-lambda with signature `DataT func(DataT, IdxT);`. This + * will be applied to every element before scattering it to the right location. + * The second param in this method will be the destination index. + */ +template , int TPB = 256> +void scatter(DataT* out, + const DataT* in, + const IdxT* idx, + IdxT len, + cudaStream_t stream, + Lambda op = raft::Nop()) +{ + if (len <= 0) return; + constexpr size_t DataSize = sizeof(DataT); + constexpr size_t IdxSize = sizeof(IdxT); + constexpr size_t MaxPerElem = DataSize > IdxSize ? DataSize : IdxSize; + size_t bytes = len * MaxPerElem; + if (16 / MaxPerElem && bytes % 16 == 0) { + detail::scatterImpl(out, in, idx, len, op, stream); + } else if (8 / MaxPerElem && bytes % 8 == 0) { + detail::scatterImpl(out, in, idx, len, op, stream); + } else if (4 / MaxPerElem && bytes % 4 == 0) { + detail::scatterImpl(out, in, idx, len, op, stream); + } else if (2 / MaxPerElem && bytes % 2 == 0) { + detail::scatterImpl(out, in, idx, len, op, stream); + } else if (1 / MaxPerElem) { + detail::scatterImpl(out, in, idx, len, op, stream); + } else { + detail::scatterImpl(out, in, idx, len, op, stream); + } +} + +} // namespace raft diff --git a/cpp/include/raft/util/seive.hpp b/cpp/include/raft/util/seive.hpp new file mode 100644 index 0000000000..ab7c77ac85 --- /dev/null +++ b/cpp/include/raft/util/seive.hpp @@ -0,0 +1,125 @@ +/* + * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#pragma once + +#include +#include + +// Taken from: +// https://github.com/teju85/programming/blob/master/euler/include/seive.h + +namespace raft { +namespace common { + +/** + * @brief Implementation of 'Seive of Eratosthenes' + */ +class Seive { + public: + /** + * @param _num number of integers for which seive is needed + */ + Seive(unsigned _num) + { + N = _num; + generateSeive(); + } + + /** + * @brief Check whether a number is prime or not + * @param num number to be checked + * @return true if the 'num' is prime, else false + */ + bool isPrime(unsigned num) const + { + unsigned mask, pos; + if (num <= 1) { return false; } + if (num == 2) { return true; } + if (!(num & 1)) { return false; } + getMaskPos(num, mask, pos); + return (seive[pos] & mask); + } + + private: + void generateSeive() + { + auto sqN = fastIntSqrt(N); + auto size = raft::ceildiv(N, sizeof(unsigned) * 8); + seive.resize(size); + // assume all to be primes initially + for (auto& itr : seive) { + itr = 0xffffffffu; + } + unsigned cid = 0; + unsigned cnum = getNum(cid); + while (cnum <= sqN) { + do { + ++cid; + cnum = getNum(cid); + if (isPrime(cnum)) { break; } + } while (cnum <= sqN); + auto cnum2 = cnum << 1; + // 'unmark' all the 'odd' multiples of the current prime + for (unsigned i = 3, num = i * cnum; num <= N; i += 2, num += cnum2) { + unmark(num); + } + } + } + + unsigned getId(unsigned num) const { return (num >> 1); } + + unsigned getNum(unsigned id) const + { + if (id == 0) { return 2; } + return ((id << 1) + 1); + } + + void getMaskPos(unsigned num, unsigned& mask, unsigned& pos) const + { + pos = getId(num); + mask = 1 << (pos & 0x1f); + pos >>= 5; + } + + void unmark(unsigned num) + { + unsigned mask, pos; + getMaskPos(num, mask, pos); + seive[pos] &= ~mask; + } + + // REF: http://www.azillionmonkeys.com/qed/ulerysqroot.pdf + unsigned fastIntSqrt(unsigned val) + { + unsigned g = 0; + auto bshft = 15u, b = 1u << bshft; + do { + unsigned temp = ((g << 1) + b) << bshft--; + if (val >= temp) { + g += b; + val -= temp; + } + } while (b >>= 1); + return g; + } + + /** find all primes till this number */ + unsigned N; + /** the seive */ + std::vector seive; +}; +}; // namespace common +}; // namespace raft diff --git a/cpp/include/raft/util/vectorized.cuh b/cpp/include/raft/util/vectorized.cuh new file mode 100644 index 0000000000..21c44d2c93 --- /dev/null +++ b/cpp/include/raft/util/vectorized.cuh @@ -0,0 +1,358 @@ +/* + * Copyright (c) 2018-2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include + +namespace raft { + +template +struct IOType { +}; +template <> +struct IOType { + static_assert(sizeof(bool) == sizeof(int8_t), "IOType bool size assumption failed"); + typedef int8_t Type; +}; +template <> +struct IOType { + typedef int16_t Type; +}; +template <> +struct IOType { + typedef int32_t Type; +}; +template <> +struct IOType { + typedef int2 Type; +}; +template <> +struct IOType { + typedef int4 Type; +}; +template <> +struct IOType { + typedef int8_t Type; +}; +template <> +struct IOType { + typedef int16_t Type; +}; +template <> +struct IOType { + typedef int32_t Type; +}; +template <> +struct IOType { + typedef int2 Type; +}; +template <> +struct IOType { + typedef int4 Type; +}; +template <> +struct IOType { + typedef uint8_t Type; +}; +template <> +struct IOType { + typedef uint16_t Type; +}; +template <> +struct IOType { + typedef uint32_t Type; +}; +template <> +struct IOType { + typedef uint2 Type; +}; +template <> +struct IOType { + typedef uint4 Type; +}; +template <> +struct IOType { + typedef int16_t Type; +}; +template <> +struct IOType { + typedef int32_t Type; +}; +template <> +struct IOType { + typedef int2 Type; +}; +template <> +struct IOType { + typedef int4 Type; +}; +template <> +struct IOType { + typedef uint16_t Type; +}; +template <> +struct IOType { + typedef uint32_t Type; +}; +template <> +struct IOType { + typedef uint2 Type; +}; +template <> +struct IOType { + typedef uint4 Type; +}; +template <> +struct IOType<__half, 1> { + typedef __half Type; +}; +template <> +struct IOType<__half, 2> { + typedef __half2 Type; +}; +template <> +struct IOType<__half, 4> { + typedef uint2 Type; +}; +template <> +struct IOType<__half, 8> { + typedef uint4 Type; +}; +template <> +struct IOType<__half2, 1> { + typedef __half2 Type; +}; +template <> +struct IOType<__half2, 2> { + typedef uint2 Type; +}; +template <> +struct IOType<__half2, 4> { + typedef uint4 Type; +}; +template <> +struct IOType { + typedef int32_t Type; +}; +template <> +struct IOType { + typedef uint2 Type; +}; +template <> +struct IOType { + typedef uint4 Type; +}; +template <> +struct IOType { + typedef uint32_t Type; +}; +template <> +struct IOType { + typedef uint2 Type; +}; +template <> +struct IOType { + typedef uint4 Type; +}; +template <> +struct IOType { + typedef float Type; +}; +template <> +struct IOType { + typedef float2 Type; +}; +template <> +struct IOType { + typedef float4 Type; +}; +template <> +struct IOType { + typedef int64_t Type; +}; +template <> +struct IOType { + typedef uint4 Type; +}; +template <> +struct IOType { + typedef uint64_t Type; +}; +template <> +struct IOType { + typedef uint4 Type; +}; +template <> +struct IOType { + typedef unsigned long long Type; +}; +template <> +struct IOType { + typedef uint4 Type; +}; +template <> +struct IOType { + typedef double Type; +}; +template <> +struct IOType { + typedef double2 Type; +}; + +/** + * @struct TxN_t + * + * @brief Internal data structure that is used to define a facade for vectorized + * loads/stores across the most common POD types. The goal of his file is to + * provide with CUDA programmers, an easy way to have compiler issue vectorized + * load or store instructions to memory (either global or shared). Vectorized + * accesses to memory are important as they'll utilize its resources + * efficiently, + * when compared to their non-vectorized counterparts. Obviously, for whatever + * reasons if one is unable to issue such vectorized operations, one can always + * fallback to using POD types. + * + * Concept of vectorized accesses : Threads process multiple elements + * to speed up processing. These are loaded in a single read thanks + * to type promotion. It is then reinterpreted as a vector elements + * to perform the kernel's work. + * + * Caution : vectorized accesses requires input adresses to be memory aligned + * according not to the input type but to the promoted type used for reading. + * + * Example demonstrating the use of load operations, performing math on such + * loaded data and finally storing it back. + * @code{.cu} + * TxN_t mydata1, mydata2; + * int idx = (threadIdx.x + (blockIdx.x * blockDim.x)) * mydata1.Ratio; + * mydata1.load(ptr1, idx); + * mydata2.load(ptr2, idx); + * #pragma unroll + * for(int i=0;i type. + * Only change required is to replace variable declaration appropriately. + * + * Obviously, it's caller's responsibility to take care of pointer alignment! + * + * @tparam math_ the data-type in which the compute/math needs to happen + * @tparam veclen_ the number of 'math_' types to be loaded/stored per + * instruction + */ +template +struct TxN_t { + /** underlying math data type */ + typedef math_ math_t; + /** internal storage data type */ + typedef typename IOType::Type io_t; + + /** defines the number of 'math_t' types stored by this struct */ + static const int Ratio = veclen_; + + struct alignas(io_t) { + /** the vectorized data that is used for subsequent operations */ + math_t data[Ratio]; + } val; + + __device__ auto* vectorized_data() { return reinterpret_cast(val.data); } + + ///@todo: add default constructor + + /** + * @brief Fill the contents of this structure with a constant value + * @param _val the constant to be filled + */ + DI void fill(math_t _val) + { +#pragma unroll + for (int i = 0; i < Ratio; ++i) { + val.data[i] = _val; + } + } + + ///@todo: how to handle out-of-bounds!!? + + /** + * @defgroup LoadsStores Global/Shared vectored loads or stores + * + * @brief Perform vectored loads/stores on this structure + * @tparam idx_t index data type + * @param ptr base pointer from where to load (or store) the data. It must + * be aligned to 'sizeof(io_t)'! + * @param idx the offset from the base pointer which will be loaded + * (or stored) by the current thread. This must be aligned to 'Ratio'! + * + * @note: In case of loads, after a successful execution, the val.data will + * be populated with the desired data loaded from the pointer location. In + * case of stores, the data in the val.data will be stored to that location. + * @{ + */ + template + DI void load(const math_t* ptr, idx_t idx) + { + const io_t* bptr = reinterpret_cast(&ptr[idx]); + *vectorized_data() = __ldg(bptr); + } + + template + DI void load(math_t* ptr, idx_t idx) + { + io_t* bptr = reinterpret_cast(&ptr[idx]); + *vectorized_data() = *bptr; + } + + template + DI void store(math_t* ptr, idx_t idx) + { + io_t* bptr = reinterpret_cast(&ptr[idx]); + *bptr = *vectorized_data(); + } + /** @} */ +}; + +/** this is just to keep the compiler happy! */ +template +struct TxN_t { + typedef math_ math_t; + static const int Ratio = 1; + + struct { + math_t data[1]; + } val; + + DI void fill(math_t _val) {} + template + DI void load(const math_t* ptr, idx_t idx) + { + } + template + DI void load(math_t* ptr, idx_t idx) + { + } + template + DI void store(math_t* ptr, idx_t idx) + { + } +}; + +} // namespace raft diff --git a/cpp/include/raft/vectorized.cuh b/cpp/include/raft/vectorized.cuh index 6f22d740ca..a472ee6191 100644 --- a/cpp/include/raft/vectorized.cuh +++ b/cpp/include/raft/vectorized.cuh @@ -1,5 +1,5 @@ /* - * Copyright (c) 2018-2022, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -13,346 +13,19 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -#pragma once - -#include "cuda_utils.cuh" -#include - -namespace raft { - -template -struct IOType { -}; -template <> -struct IOType { - static_assert(sizeof(bool) == sizeof(int8_t), "IOType bool size assumption failed"); - typedef int8_t Type; -}; -template <> -struct IOType { - typedef int16_t Type; -}; -template <> -struct IOType { - typedef int32_t Type; -}; -template <> -struct IOType { - typedef int2 Type; -}; -template <> -struct IOType { - typedef int4 Type; -}; -template <> -struct IOType { - typedef int8_t Type; -}; -template <> -struct IOType { - typedef int16_t Type; -}; -template <> -struct IOType { - typedef int32_t Type; -}; -template <> -struct IOType { - typedef int2 Type; -}; -template <> -struct IOType { - typedef int4 Type; -}; -template <> -struct IOType { - typedef uint8_t Type; -}; -template <> -struct IOType { - typedef uint16_t Type; -}; -template <> -struct IOType { - typedef uint32_t Type; -}; -template <> -struct IOType { - typedef uint2 Type; -}; -template <> -struct IOType { - typedef uint4 Type; -}; -template <> -struct IOType { - typedef int16_t Type; -}; -template <> -struct IOType { - typedef int32_t Type; -}; -template <> -struct IOType { - typedef int2 Type; -}; -template <> -struct IOType { - typedef int4 Type; -}; -template <> -struct IOType { - typedef uint16_t Type; -}; -template <> -struct IOType { - typedef uint32_t Type; -}; -template <> -struct IOType { - typedef uint2 Type; -}; -template <> -struct IOType { - typedef uint4 Type; -}; -template <> -struct IOType<__half, 1> { - typedef __half Type; -}; -template <> -struct IOType<__half, 2> { - typedef __half2 Type; -}; -template <> -struct IOType<__half, 4> { - typedef uint2 Type; -}; -template <> -struct IOType<__half, 8> { - typedef uint4 Type; -}; -template <> -struct IOType<__half2, 1> { - typedef __half2 Type; -}; -template <> -struct IOType<__half2, 2> { - typedef uint2 Type; -}; -template <> -struct IOType<__half2, 4> { - typedef uint4 Type; -}; -template <> -struct IOType { - typedef int32_t Type; -}; -template <> -struct IOType { - typedef uint2 Type; -}; -template <> -struct IOType { - typedef uint4 Type; -}; -template <> -struct IOType { - typedef uint32_t Type; -}; -template <> -struct IOType { - typedef uint2 Type; -}; -template <> -struct IOType { - typedef uint4 Type; -}; -template <> -struct IOType { - typedef float Type; -}; -template <> -struct IOType { - typedef float2 Type; -}; -template <> -struct IOType { - typedef float4 Type; -}; -template <> -struct IOType { - typedef int64_t Type; -}; -template <> -struct IOType { - typedef uint4 Type; -}; -template <> -struct IOType { - typedef uint64_t Type; -}; -template <> -struct IOType { - typedef uint4 Type; -}; -template <> -struct IOType { - typedef unsigned long long Type; -}; -template <> -struct IOType { - typedef uint4 Type; -}; -template <> -struct IOType { - typedef double Type; -}; -template <> -struct IOType { - typedef double2 Type; -}; - /** - * @struct TxN_t - * - * @brief Internal data structure that is used to define a facade for vectorized - * loads/stores across the most common POD types. The goal of his file is to - * provide with CUDA programmers, an easy way to have compiler issue vectorized - * load or store instructions to memory (either global or shared). Vectorized - * accesses to memory are important as they'll utilize its resources - * efficiently, - * when compared to their non-vectorized counterparts. Obviously, for whatever - * reasons if one is unable to issue such vectorized operations, one can always - * fallback to using POD types. - * - * Concept of vectorized accesses : Threads process multiple elements - * to speed up processing. These are loaded in a single read thanks - * to type promotion. It is then reinterpreted as a vector elements - * to perform the kernel's work. - * - * Caution : vectorized accesses requires input adresses to be memory aligned - * according not to the input type but to the promoted type used for reading. - * - * Example demonstrating the use of load operations, performing math on such - * loaded data and finally storing it back. - * @code{.cu} - * TxN_t mydata1, mydata2; - * int idx = (threadIdx.x + (blockIdx.x * blockDim.x)) * mydata1.Ratio; - * mydata1.load(ptr1, idx); - * mydata2.load(ptr2, idx); - * #pragma unroll - * for(int i=0;i type. - * Only change required is to replace variable declaration appropriately. - * - * Obviously, it's caller's responsibility to take care of pointer alignment! - * - * @tparam math_ the data-type in which the compute/math needs to happen - * @tparam veclen_ the number of 'math_' types to be loaded/stored per - * instruction + * This file is deprecated and will be removed in release 22.06. + * Please use the cuh version instead. */ -template -struct TxN_t { - /** underlying math data type */ - typedef math_ math_t; - /** internal storage data type */ - typedef typename IOType::Type io_t; - /** defines the number of 'math_t' types stored by this struct */ - static const int Ratio = veclen_; - - struct alignas(io_t) { - /** the vectorized data that is used for subsequent operations */ - math_t data[Ratio]; - } val; - - __device__ auto* vectorized_data() { return reinterpret_cast(val.data); } - - ///@todo: add default constructor - - /** - * @brief Fill the contents of this structure with a constant value - * @param _val the constant to be filled - */ - DI void fill(math_t _val) - { -#pragma unroll - for (int i = 0; i < Ratio; ++i) { - val.data[i] = _val; - } - } - - ///@todo: how to handle out-of-bounds!!? - - /** - * @defgroup LoadsStores Global/Shared vectored loads or stores - * - * @brief Perform vectored loads/stores on this structure - * @tparam idx_t index data type - * @param ptr base pointer from where to load (or store) the data. It must - * be aligned to 'sizeof(io_t)'! - * @param idx the offset from the base pointer which will be loaded - * (or stored) by the current thread. This must be aligned to 'Ratio'! - * - * @note: In case of loads, after a successful execution, the val.data will - * be populated with the desired data loaded from the pointer location. In - * case of stores, the data in the val.data will be stored to that location. - * @{ - */ - template - DI void load(const math_t* ptr, idx_t idx) - { - const io_t* bptr = reinterpret_cast(&ptr[idx]); - *vectorized_data() = __ldg(bptr); - } - - template - DI void load(math_t* ptr, idx_t idx) - { - io_t* bptr = reinterpret_cast(&ptr[idx]); - *vectorized_data() = *bptr; - } - - template - DI void store(math_t* ptr, idx_t idx) - { - io_t* bptr = reinterpret_cast(&ptr[idx]); - *bptr = *vectorized_data(); - } - /** @} */ -}; - -/** this is just to keep the compiler happy! */ -template -struct TxN_t { - typedef math_ math_t; - static const int Ratio = 1; +/** + * DISCLAIMER: this file is deprecated: use lap.cuh instead + */ - struct { - math_t data[1]; - } val; +#pragma once - DI void fill(math_t _val) {} - template - DI void load(const math_t* ptr, idx_t idx) - { - } - template - DI void load(math_t* ptr, idx_t idx) - { - } - template - DI void store(math_t* ptr, idx_t idx) - { - } -}; +#pragma message(__FILE__ \ + " is deprecated and will be removed in a future release." \ + " Please use the raft/util version instead.") -} // namespace raft +#include diff --git a/cpp/include/raft_distance/pairwise_distance.hpp b/cpp/include/raft_distance/pairwise_distance.hpp index 50fdbbdd8c..e91ef5de20 100644 --- a/cpp/include/raft_distance/pairwise_distance.hpp +++ b/cpp/include/raft_distance/pairwise_distance.hpp @@ -14,7 +14,7 @@ * limitations under the License. */ -#include +#include namespace raft::distance::runtime { void pairwise_distance(raft::handle_t const& handle, diff --git a/cpp/src/distance/pairwise_distance.cu b/cpp/src/distance/pairwise_distance.cu index 3a9ff469a1..71133c5f84 100644 --- a/cpp/src/distance/pairwise_distance.cu +++ b/cpp/src/distance/pairwise_distance.cu @@ -15,7 +15,7 @@ */ #include -#include +#include #include #include diff --git a/cpp/src/nn/specializations/ball_cover.cu b/cpp/src/nn/specializations/ball_cover.cu index d142a49264..87796752d9 100644 --- a/cpp/src/nn/specializations/ball_cover.cu +++ b/cpp/src/nn/specializations/ball_cover.cu @@ -15,7 +15,7 @@ */ #include -#include +#include // Ignore upstream specializations to avoid unnecessary recompiling #include diff --git a/cpp/test/cluster/kmeans.cu b/cpp/test/cluster/kmeans.cu index 24fe2c03cd..8a01ef6fe9 100644 --- a/cpp/test/cluster/kmeans.cu +++ b/cpp/test/cluster/kmeans.cu @@ -23,9 +23,9 @@ #include #include #include -#include #include #include +#include #include #include diff --git a/cpp/test/cluster_solvers.cu b/cpp/test/cluster_solvers.cu index 0c74b81e99..26fbfec011 100644 --- a/cpp/test/cluster_solvers.cu +++ b/cpp/test/cluster_solvers.cu @@ -17,7 +17,7 @@ #include #include #include -#include +#include #if defined RAFT_DISTANCE_COMPILED && defined RAFT_NN_COMPILED #include diff --git a/cpp/test/cluster_solvers_deprecated.cu b/cpp/test/cluster_solvers_deprecated.cu index d169d4a7b9..1e9ec0c15b 100644 --- a/cpp/test/cluster_solvers_deprecated.cu +++ b/cpp/test/cluster_solvers_deprecated.cu @@ -17,7 +17,7 @@ #include #include #include -#include +#include #include #include diff --git a/cpp/test/cudart_utils.cpp b/cpp/test/cudart_utils.cpp index 9df8600527..8c47372c4f 100644 --- a/cpp/test/cudart_utils.cpp +++ b/cpp/test/cudart_utils.cpp @@ -1,5 +1,5 @@ /* - * Copyright (c) 2020, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -14,7 +14,7 @@ * limitations under the License. */ -#include +#include #include diff --git a/cpp/test/device_atomics.cu b/cpp/test/device_atomics.cu index 8ecedbe7af..4e56b8d486 100644 --- a/cpp/test/device_atomics.cu +++ b/cpp/test/device_atomics.cu @@ -21,8 +21,8 @@ #include #include #include -#include -#include +#include +#include #include #include #include diff --git a/cpp/test/distance/dist_adj.cu b/cpp/test/distance/dist_adj.cu index 16c6e11719..72906af1b2 100644 --- a/cpp/test/distance/dist_adj.cu +++ b/cpp/test/distance/dist_adj.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" #include -#include -#include #include #include +#include +#include #include namespace raft { diff --git a/cpp/test/distance/distance_base.cuh b/cpp/test/distance/distance_base.cuh index 16d3526b17..19d449c18b 100644 --- a/cpp/test/distance/distance_base.cuh +++ b/cpp/test/distance/distance_base.cuh @@ -17,10 +17,10 @@ #include "../test_utils.h" #include #include -#include -#include #include #include +#include +#include #if defined RAFT_DISTANCE_COMPILED #include #endif diff --git a/cpp/test/distance/fused_l2_nn.cu b/cpp/test/distance/fused_l2_nn.cu index 2a5b30e01f..2838a2209e 100644 --- a/cpp/test/distance/fused_l2_nn.cu +++ b/cpp/test/distance/fused_l2_nn.cu @@ -16,12 +16,12 @@ #include "../test_utils.h" #include -#include -#include #include #include #include #include +#include +#include // TODO: Once fusedL2NN is specialized in the raft_distance shared library, add // the following: diff --git a/cpp/test/eigen_solvers.cu b/cpp/test/eigen_solvers.cu index 635908240b..68b431b894 100644 --- a/cpp/test/eigen_solvers.cu +++ b/cpp/test/eigen_solvers.cu @@ -15,7 +15,7 @@ */ #include -#include +#include #include #include diff --git a/cpp/test/handle.cpp b/cpp/test/handle.cpp index d594a49e83..2ebc38d03a 100644 --- a/cpp/test/handle.cpp +++ b/cpp/test/handle.cpp @@ -18,7 +18,7 @@ #include #include #include -#include +#include namespace raft { diff --git a/cpp/test/label/label.cu b/cpp/test/label/label.cu index 06f25cb308..02b3191c4d 100644 --- a/cpp/test/label/label.cu +++ b/cpp/test/label/label.cu @@ -19,8 +19,8 @@ #include #include "../test_utils.h" -#include -#include +#include +#include #include #include diff --git a/cpp/test/label/merge_labels.cu b/cpp/test/label/merge_labels.cu index cab8c44969..184ab4922f 100644 --- a/cpp/test/label/merge_labels.cu +++ b/cpp/test/label/merge_labels.cu @@ -18,8 +18,8 @@ #include #include "../test_utils.h" -#include -#include +#include +#include #include #include #include diff --git a/cpp/test/lap/lap.cu b/cpp/test/lap/lap.cu index 1f847ceef3..58fd94f343 100644 --- a/cpp/test/lap/lap.cu +++ b/cpp/test/lap/lap.cu @@ -28,7 +28,7 @@ #include #include -#include +#include #include #define PROBLEMSIZE 1000 // Number of rows/columns @@ -85,7 +85,7 @@ void hungarian_test(int problemsize, float start = omp_get_wtime(); // Create an instance of LinearAssignmentProblem using problem size, number of subproblems - raft::lap::LinearAssignmentProblem lpx( + raft::solver::LinearAssignmentProblem lpx( handle, problemsize, batchsize, epsilon); // Solve LAP(s) for given cost matrix diff --git a/cpp/test/linalg/add.cu b/cpp/test/linalg/add.cu index ba9dac5ac2..d9a90321e1 100644 --- a/cpp/test/linalg/add.cu +++ b/cpp/test/linalg/add.cu @@ -17,9 +17,9 @@ #include "../test_utils.h" #include "add.cuh" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/add.cuh b/cpp/test/linalg/add.cuh index 215b4d3805..c33a1d66e0 100644 --- a/cpp/test/linalg/add.cuh +++ b/cpp/test/linalg/add.cuh @@ -16,8 +16,8 @@ #pragma once -#include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/binary_op.cu b/cpp/test/linalg/binary_op.cu index cd4340f5cd..25383c5ca1 100644 --- a/cpp/test/linalg/binary_op.cu +++ b/cpp/test/linalg/binary_op.cu @@ -17,9 +17,9 @@ #include "../test_utils.h" #include "binary_op.cuh" #include -#include #include #include +#include #include namespace raft { diff --git a/cpp/test/linalg/binary_op.cuh b/cpp/test/linalg/binary_op.cuh index 763398aff1..62820ddb97 100644 --- a/cpp/test/linalg/binary_op.cuh +++ b/cpp/test/linalg/binary_op.cuh @@ -17,8 +17,8 @@ #pragma once #include "../test_utils.h" -#include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/cholesky_r1.cu b/cpp/test/linalg/cholesky_r1.cu index c057c20403..ec8a31fa34 100644 --- a/cpp/test/linalg/cholesky_r1.cu +++ b/cpp/test/linalg/cholesky_r1.cu @@ -15,10 +15,10 @@ */ #include -#include -#include +#include #include #include +#include #include #include diff --git a/cpp/test/linalg/coalesced_reduction.cu b/cpp/test/linalg/coalesced_reduction.cu index ac79518942..8e28b35cef 100644 --- a/cpp/test/linalg/coalesced_reduction.cu +++ b/cpp/test/linalg/coalesced_reduction.cu @@ -17,10 +17,10 @@ #include "../test_utils.h" #include "reduce.cuh" #include -#include -#include #include #include +#include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/divide.cu b/cpp/test/linalg/divide.cu index d620979c2f..11f451cb84 100644 --- a/cpp/test/linalg/divide.cu +++ b/cpp/test/linalg/divide.cu @@ -17,9 +17,9 @@ #include "../test_utils.h" #include "unary_op.cuh" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/eig.cu b/cpp/test/linalg/eig.cu index ca38e854ce..e05cb8f5fd 100644 --- a/cpp/test/linalg/eig.cu +++ b/cpp/test/linalg/eig.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" #include -#include -#include #include #include +#include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/eig_sel.cu b/cpp/test/linalg/eig_sel.cu index 23ded35174..cc1dd589d0 100644 --- a/cpp/test/linalg/eig_sel.cu +++ b/cpp/test/linalg/eig_sel.cu @@ -18,9 +18,9 @@ #include "../test_utils.h" #include -#include -#include #include +#include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/eltwise.cu b/cpp/test/linalg/eltwise.cu index e2bc80eefe..07ded5ec79 100644 --- a/cpp/test/linalg/eltwise.cu +++ b/cpp/test/linalg/eltwise.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/gemm_layout.cu b/cpp/test/linalg/gemm_layout.cu index 967c792b6b..4b05004ccf 100644 --- a/cpp/test/linalg/gemm_layout.cu +++ b/cpp/test/linalg/gemm_layout.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/gemv.cu b/cpp/test/linalg/gemv.cu index 97f5f6de94..f4c437bdfc 100644 --- a/cpp/test/linalg/gemv.cu +++ b/cpp/test/linalg/gemv.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/map.cu b/cpp/test/linalg/map.cu index bcaacb3c8f..6fa26456e3 100644 --- a/cpp/test/linalg/map.cu +++ b/cpp/test/linalg/map.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" #include -#include #include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/map_then_reduce.cu b/cpp/test/linalg/map_then_reduce.cu index 94bb5023c8..170962006f 100644 --- a/cpp/test/linalg/map_then_reduce.cu +++ b/cpp/test/linalg/map_then_reduce.cu @@ -17,9 +17,9 @@ #include "../test_utils.h" #include #include -#include #include #include +#include #include #include diff --git a/cpp/test/linalg/matrix_vector_op.cu b/cpp/test/linalg/matrix_vector_op.cu index b01b3a1ca1..74ba250f86 100644 --- a/cpp/test/linalg/matrix_vector_op.cu +++ b/cpp/test/linalg/matrix_vector_op.cu @@ -17,8 +17,8 @@ #include "../test_utils.h" #include "matrix_vector_op.cuh" #include -#include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/matrix_vector_op.cuh b/cpp/test/linalg/matrix_vector_op.cuh index 1e5812ba89..f46d70eaa3 100644 --- a/cpp/test/linalg/matrix_vector_op.cuh +++ b/cpp/test/linalg/matrix_vector_op.cuh @@ -15,8 +15,8 @@ */ #include "../test_utils.h" -#include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/multiply.cu b/cpp/test/linalg/multiply.cu index e91201aa12..852b869676 100644 --- a/cpp/test/linalg/multiply.cu +++ b/cpp/test/linalg/multiply.cu @@ -17,9 +17,9 @@ #include "../test_utils.h" #include "unary_op.cuh" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/norm.cu b/cpp/test/linalg/norm.cu index 83ded7d052..a07e5a8a7a 100644 --- a/cpp/test/linalg/norm.cu +++ b/cpp/test/linalg/norm.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/power.cu b/cpp/test/linalg/power.cu index 7c93b52d59..e66aa4b4ae 100644 --- a/cpp/test/linalg/power.cu +++ b/cpp/test/linalg/power.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/reduce.cu b/cpp/test/linalg/reduce.cu index 19d6130df9..674cb24069 100644 --- a/cpp/test/linalg/reduce.cu +++ b/cpp/test/linalg/reduce.cu @@ -17,10 +17,10 @@ #include "../test_utils.h" #include "reduce.cuh" #include -#include -#include #include #include +#include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/reduce.cuh b/cpp/test/linalg/reduce.cuh index 130f10b1cc..162bf9f2c1 100644 --- a/cpp/test/linalg/reduce.cuh +++ b/cpp/test/linalg/reduce.cuh @@ -17,9 +17,9 @@ #pragma once #include -#include #include #include +#include #include #include diff --git a/cpp/test/linalg/reduce_cols_by_key.cu b/cpp/test/linalg/reduce_cols_by_key.cu index 6682f54ace..5d4ea359a3 100644 --- a/cpp/test/linalg/reduce_cols_by_key.cu +++ b/cpp/test/linalg/reduce_cols_by_key.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" #include -#include #include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/reduce_rows_by_key.cu b/cpp/test/linalg/reduce_rows_by_key.cu index 5ebf6c5daa..e8baeb5887 100644 --- a/cpp/test/linalg/reduce_rows_by_key.cu +++ b/cpp/test/linalg/reduce_rows_by_key.cu @@ -17,9 +17,9 @@ #include "../test_utils.h" #include #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/rsvd.cu b/cpp/test/linalg/rsvd.cu index 568ab504a2..01736615eb 100644 --- a/cpp/test/linalg/rsvd.cu +++ b/cpp/test/linalg/rsvd.cu @@ -16,11 +16,11 @@ #include "../test_utils.h" #include -#include -#include -#include +#include #include #include +#include +#include #include #include diff --git a/cpp/test/linalg/sqrt.cu b/cpp/test/linalg/sqrt.cu index b9fff65a80..bb78d9f754 100644 --- a/cpp/test/linalg/sqrt.cu +++ b/cpp/test/linalg/sqrt.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/strided_reduction.cu b/cpp/test/linalg/strided_reduction.cu index a1b6f9de0d..c4f02310a5 100644 --- a/cpp/test/linalg/strided_reduction.cu +++ b/cpp/test/linalg/strided_reduction.cu @@ -17,9 +17,9 @@ #include "../test_utils.h" #include "reduce.cuh" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/subtract.cu b/cpp/test/linalg/subtract.cu index beb2b81677..455f5e6c30 100644 --- a/cpp/test/linalg/subtract.cu +++ b/cpp/test/linalg/subtract.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/svd.cu b/cpp/test/linalg/svd.cu index d78a21f6a6..292793478c 100644 --- a/cpp/test/linalg/svd.cu +++ b/cpp/test/linalg/svd.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" #include -#include -#include #include #include +#include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/ternary_op.cu b/cpp/test/linalg/ternary_op.cu index a34274a412..21573eff48 100644 --- a/cpp/test/linalg/ternary_op.cu +++ b/cpp/test/linalg/ternary_op.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/transpose.cu b/cpp/test/linalg/transpose.cu index 98f6d5e7e4..432ff093f6 100644 --- a/cpp/test/linalg/transpose.cu +++ b/cpp/test/linalg/transpose.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" -#include -#include #include +#include +#include #include diff --git a/cpp/test/linalg/unary_op.cu b/cpp/test/linalg/unary_op.cu index 8d4725b72f..4174056170 100644 --- a/cpp/test/linalg/unary_op.cu +++ b/cpp/test/linalg/unary_op.cu @@ -17,9 +17,9 @@ #include "../test_utils.h" #include "unary_op.cuh" #include -#include #include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/linalg/unary_op.cuh b/cpp/test/linalg/unary_op.cuh index 625fe7ab00..190d531a9f 100644 --- a/cpp/test/linalg/unary_op.cuh +++ b/cpp/test/linalg/unary_op.cuh @@ -17,8 +17,8 @@ #pragma once #include "../test_utils.h" -#include #include +#include namespace raft { namespace linalg { diff --git a/cpp/test/matrix/columnSort.cu b/cpp/test/matrix/columnSort.cu index dbfaacaa9a..325ed0204b 100644 --- a/cpp/test/matrix/columnSort.cu +++ b/cpp/test/matrix/columnSort.cu @@ -18,8 +18,8 @@ #include #include #include -#include #include +#include #include namespace raft { diff --git a/cpp/test/matrix/gather.cu b/cpp/test/matrix/gather.cu index 2baeb81881..adedaacc81 100644 --- a/cpp/test/matrix/gather.cu +++ b/cpp/test/matrix/gather.cu @@ -16,9 +16,9 @@ #include #include -#include #include #include +#include #include #include diff --git a/cpp/test/matrix/linewise_op.cu b/cpp/test/matrix/linewise_op.cu index 5f1df13aec..16e2ceb29a 100644 --- a/cpp/test/matrix/linewise_op.cu +++ b/cpp/test/matrix/linewise_op.cu @@ -19,10 +19,10 @@ #include #include #include -#include #include #include #include +#include #include namespace raft { diff --git a/cpp/test/matrix/math.cu b/cpp/test/matrix/math.cu index 30a6ed7083..d550852150 100644 --- a/cpp/test/matrix/math.cu +++ b/cpp/test/matrix/math.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include namespace raft { namespace matrix { diff --git a/cpp/test/matrix/matrix.cu b/cpp/test/matrix/matrix.cu index 654043ba41..6ccd7aa335 100644 --- a/cpp/test/matrix/matrix.cu +++ b/cpp/test/matrix/matrix.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include #include #include diff --git a/cpp/test/mdarray.cu b/cpp/test/mdarray.cu index af7bb7adf3..271aaaab72 100644 --- a/cpp/test/mdarray.cu +++ b/cpp/test/mdarray.cu @@ -16,8 +16,8 @@ #include #include #include -#include -#include +#include +#include #include #include #include diff --git a/cpp/test/mr/device/buffer.cpp b/cpp/test/mr/device/buffer.cpp index b060568981..447ab77b35 100644 --- a/cpp/test/mr/device/buffer.cpp +++ b/cpp/test/mr/device/buffer.cpp @@ -17,7 +17,7 @@ #include #include #include -#include +#include #include #include diff --git a/cpp/test/mst.cu b/cpp/test/mst.cu index 6b42e4b328..544ca80a46 100644 --- a/cpp/test/mst.cu +++ b/cpp/test/mst.cu @@ -22,9 +22,9 @@ #include #include -#include -#include +#include #include +#include #include diff --git a/cpp/test/random/make_blobs.cu b/cpp/test/random/make_blobs.cu index 3f75a4cf0a..bdfe6f94b4 100644 --- a/cpp/test/random/make_blobs.cu +++ b/cpp/test/random/make_blobs.cu @@ -17,10 +17,10 @@ #include "../test_utils.h" #include #include -#include -#include #include #include +#include +#include namespace raft { namespace random { diff --git a/cpp/test/random/make_regression.cu b/cpp/test/random/make_regression.cu index 84dadf1e24..65d4c4cb31 100644 --- a/cpp/test/random/make_regression.cu +++ b/cpp/test/random/make_regression.cu @@ -20,12 +20,12 @@ #include #include "../test_utils.h" -#include -#include #include #include #include #include +#include +#include namespace raft::random { diff --git a/cpp/test/random/multi_variable_gaussian.cu b/cpp/test/random/multi_variable_gaussian.cu index 58fbed7eb2..c346fbf426 100644 --- a/cpp/test/random/multi_variable_gaussian.cu +++ b/cpp/test/random/multi_variable_gaussian.cu @@ -18,8 +18,8 @@ #include #include #include -#include #include +#include #include #include diff --git a/cpp/test/random/permute.cu b/cpp/test/random/permute.cu index 6b0ee0457e..a0e9f2f25f 100644 --- a/cpp/test/random/permute.cu +++ b/cpp/test/random/permute.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" #include -#include -#include #include #include +#include +#include #include namespace raft { diff --git a/cpp/test/random/rmat_rectangular_generator.cu b/cpp/test/random/rmat_rectangular_generator.cu index 4ccda9c1fa..194f89dd65 100644 --- a/cpp/test/random/rmat_rectangular_generator.cu +++ b/cpp/test/random/rmat_rectangular_generator.cu @@ -21,10 +21,10 @@ #include "../test_utils.h" -#include -#include #include #include +#include +#include namespace raft { namespace random { diff --git a/cpp/test/random/rng.cu b/cpp/test/random/rng.cu index 58107293ee..d778555076 100644 --- a/cpp/test/random/rng.cu +++ b/cpp/test/random/rng.cu @@ -19,11 +19,11 @@ #include "../test_utils.h" #include #include -#include -#include #include #include #include +#include +#include namespace raft { namespace random { diff --git a/cpp/test/random/rng_int.cu b/cpp/test/random/rng_int.cu index 118d4f070b..8efd9cd5af 100644 --- a/cpp/test/random/rng_int.cu +++ b/cpp/test/random/rng_int.cu @@ -17,9 +17,9 @@ #include "../test_utils.h" #include #include -#include -#include #include +#include +#include namespace raft { namespace random { diff --git a/cpp/test/random/sample_without_replacement.cu b/cpp/test/random/sample_without_replacement.cu index 63f0b20df4..653a9f9bc9 100644 --- a/cpp/test/random/sample_without_replacement.cu +++ b/cpp/test/random/sample_without_replacement.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include -#include #include +#include +#include #include #include diff --git a/cpp/test/span.cu b/cpp/test/span.cu index dcde9b5432..d63d0046dc 100644 --- a/cpp/test/span.cu +++ b/cpp/test/span.cu @@ -16,9 +16,9 @@ #include "test_span.hpp" #include #include // iota -#include -#include #include +#include +#include #include #include diff --git a/cpp/test/sparse/add.cu b/cpp/test/sparse/add.cu index eefe8da385..862cbffdc7 100644 --- a/cpp/test/sparse/add.cu +++ b/cpp/test/sparse/add.cu @@ -16,12 +16,12 @@ #include -#include +#include #include #include #include "../test_utils.h" -#include +#include #include #include diff --git a/cpp/test/sparse/connect_components.cu b/cpp/test/sparse/connect_components.cu index 167c88e264..6278e7ef80 100644 --- a/cpp/test/sparse/connect_components.cu +++ b/cpp/test/sparse/connect_components.cu @@ -18,15 +18,16 @@ #include -#include -#include +#include +#include #include #include #include -#include +#include +#include -#include +#include #include #include #include @@ -74,13 +75,13 @@ class ConnectComponentsTest */ raft::sparse::COO knn_graph_coo(stream); - raft::sparse::selection::knn_graph(handle, - data.data(), - params.n_row, - params.n_col, - raft::distance::DistanceType::L2SqrtExpanded, - knn_graph_coo, - params.c); + raft::sparse::spatial::knn_graph(handle, + data.data(), + params.n_row, + params.n_col, + raft::distance::DistanceType::L2SqrtExpanded, + knn_graph_coo, + params.c); raft::sparse::convert::sorted_coo_to_csr( knn_graph_coo.rows(), knn_graph_coo.nnz, indptr.data(), params.n_row + 1, stream); diff --git a/cpp/test/sparse/convert_coo.cu b/cpp/test/sparse/convert_coo.cu index e9ba4f4448..1142c6f3f2 100644 --- a/cpp/test/sparse/convert_coo.cu +++ b/cpp/test/sparse/convert_coo.cu @@ -16,11 +16,11 @@ #include -#include +#include #include #include -#include +#include #include "../test_utils.h" diff --git a/cpp/test/sparse/convert_csr.cu b/cpp/test/sparse/convert_csr.cu index a217a90e19..007cbd7fdb 100644 --- a/cpp/test/sparse/convert_csr.cu +++ b/cpp/test/sparse/convert_csr.cu @@ -16,7 +16,7 @@ #include "../test_utils.h" #include -#include +#include #include #include diff --git a/cpp/test/sparse/csr_row_slice.cu b/cpp/test/sparse/csr_row_slice.cu index fa2b88cdef..39b235d5f1 100644 --- a/cpp/test/sparse/csr_row_slice.cu +++ b/cpp/test/sparse/csr_row_slice.cu @@ -15,8 +15,8 @@ */ #include -#include -#include +#include +#include #include #include diff --git a/cpp/test/sparse/csr_to_dense.cu b/cpp/test/sparse/csr_to_dense.cu index fbc3708b37..5811c5c22b 100644 --- a/cpp/test/sparse/csr_to_dense.cu +++ b/cpp/test/sparse/csr_to_dense.cu @@ -15,8 +15,8 @@ */ #include -#include -#include +#include +#include #include #include diff --git a/cpp/test/sparse/csr_transpose.cu b/cpp/test/sparse/csr_transpose.cu index d06a365b15..bea8f903cd 100644 --- a/cpp/test/sparse/csr_transpose.cu +++ b/cpp/test/sparse/csr_transpose.cu @@ -18,10 +18,10 @@ #include -#include -#include +#include #include #include +#include #include "../test_utils.h" diff --git a/cpp/test/sparse/degree.cu b/cpp/test/sparse/degree.cu index 9dce582781..a4af021c05 100644 --- a/cpp/test/sparse/degree.cu +++ b/cpp/test/sparse/degree.cu @@ -16,7 +16,7 @@ #include "../test_utils.h" #include -#include +#include #include diff --git a/cpp/test/sparse/dist_coo_spmv.cu b/cpp/test/sparse/dist_coo_spmv.cu index 1ccff3532f..c004aeaef0 100644 --- a/cpp/test/sparse/dist_coo_spmv.cu +++ b/cpp/test/sparse/dist_coo_spmv.cu @@ -16,10 +16,10 @@ #include -#include -#include +#include #include #include +#include #include #include diff --git a/cpp/test/sparse/distance.cu b/cpp/test/sparse/distance.cu index d211a2a0c8..4ce2f4cbde 100644 --- a/cpp/test/sparse/distance.cu +++ b/cpp/test/sparse/distance.cu @@ -18,9 +18,9 @@ #include -#include -#include +#include #include +#include #include diff --git a/cpp/test/sparse/filter.cu b/cpp/test/sparse/filter.cu index c22fe09134..ba80c84fd5 100644 --- a/cpp/test/sparse/filter.cu +++ b/cpp/test/sparse/filter.cu @@ -16,7 +16,7 @@ #include "../test_utils.h" #include -#include +#include #include #include diff --git a/cpp/test/sparse/knn.cu b/cpp/test/sparse/knn.cu index 7ced61fa9c..6717ba411d 100644 --- a/cpp/test/sparse/knn.cu +++ b/cpp/test/sparse/knn.cu @@ -18,10 +18,10 @@ #include #include "../test_utils.h" -#include -#include +#include +#include -#include +#include namespace raft { namespace sparse { @@ -79,25 +79,25 @@ class SparseKNNTest : public ::testing::TestWithParam(indptr.data(), - indices.data(), - data.data(), - nnz, - n_rows, - params.n_cols, - indptr.data(), - indices.data(), - data.data(), - nnz, - n_rows, - params.n_cols, - out_indices.data(), - out_dists.data(), - k, - handle, - params.batch_size_index, - params.batch_size_query, - params.metric); + raft::sparse::spatial::brute_force_knn(indptr.data(), + indices.data(), + data.data(), + nnz, + n_rows, + params.n_cols, + indptr.data(), + indices.data(), + data.data(), + nnz, + n_rows, + params.n_cols, + out_indices.data(), + out_dists.data(), + k, + handle, + params.batch_size_index, + params.batch_size_query, + params.metric); RAFT_CUDA_TRY(cudaStreamSynchronize(handle.get_stream())); } diff --git a/cpp/test/sparse/knn_graph.cu b/cpp/test/sparse/knn_graph.cu index 41863a8557..47c1819e79 100644 --- a/cpp/test/sparse/knn_graph.cu +++ b/cpp/test/sparse/knn_graph.cu @@ -16,12 +16,12 @@ #include "../test_utils.h" #include -#include +#include #include #include #include -#include +#include #if defined RAFT_NN_COMPILED #include #endif @@ -77,7 +77,7 @@ class KNNGraphTest : public ::testing::TestWithParam sum(stream); diff --git a/cpp/test/sparse/linkage.cu b/cpp/test/sparse/linkage.cu index 35501c661a..045647f23e 100644 --- a/cpp/test/sparse/linkage.cu +++ b/cpp/test/sparse/linkage.cu @@ -16,11 +16,11 @@ #include "../test_utils.h" -#include -#include +#include #include #include #include +#include #include diff --git a/cpp/test/sparse/norm.cu b/cpp/test/sparse/norm.cu index 9077b6467d..8e54edd6c9 100644 --- a/cpp/test/sparse/norm.cu +++ b/cpp/test/sparse/norm.cu @@ -18,10 +18,10 @@ #include "../test_utils.h" -#include -#include +#include #include #include +#include #include #include diff --git a/cpp/test/sparse/reduce.cu b/cpp/test/sparse/reduce.cu index c605943cb4..4280192723 100644 --- a/cpp/test/sparse/reduce.cu +++ b/cpp/test/sparse/reduce.cu @@ -19,10 +19,10 @@ #include "../test_utils.h" #include #include -#include -#include +#include #include #include +#include #include namespace raft { diff --git a/cpp/test/sparse/row_op.cu b/cpp/test/sparse/row_op.cu index a53cbe560f..732bd06103 100644 --- a/cpp/test/sparse/row_op.cu +++ b/cpp/test/sparse/row_op.cu @@ -20,7 +20,7 @@ #include #include "../test_utils.h" -#include +#include #include #include diff --git a/cpp/test/sparse/sort.cu b/cpp/test/sparse/sort.cu index 645e01052a..9b75965498 100644 --- a/cpp/test/sparse/sort.cu +++ b/cpp/test/sparse/sort.cu @@ -16,8 +16,8 @@ #include "../test_utils.h" #include -#include #include +#include #include diff --git a/cpp/test/sparse/symmetrize.cu b/cpp/test/sparse/symmetrize.cu index 299c2d10e3..7cf1a1e07d 100644 --- a/cpp/test/sparse/symmetrize.cu +++ b/cpp/test/sparse/symmetrize.cu @@ -15,10 +15,10 @@ */ #include -#include #include #include #include +#include #include #include diff --git a/cpp/test/spatial/ann_base_kernel.cuh b/cpp/test/spatial/ann_base_kernel.cuh index 4462875de2..8af3ebe4f3 100644 --- a/cpp/test/spatial/ann_base_kernel.cuh +++ b/cpp/test/spatial/ann_base_kernel.cuh @@ -14,9 +14,9 @@ * limitations under the License. */ -#include -#include +#include #include +#include #include diff --git a/cpp/test/spatial/ann_ivf_flat.cu b/cpp/test/spatial/ann_ivf_flat.cu index 75b39d7046..a049c3f428 100644 --- a/cpp/test/spatial/ann_ivf_flat.cu +++ b/cpp/test/spatial/ann_ivf_flat.cu @@ -18,7 +18,7 @@ #include "./ann_base_kernel.cuh" #include -#include +#include #include #include #include diff --git a/cpp/test/spatial/ball_cover.cu b/cpp/test/spatial/ball_cover.cu index a23262fc8e..46867f0fa7 100644 --- a/cpp/test/spatial/ball_cover.cu +++ b/cpp/test/spatial/ball_cover.cu @@ -16,11 +16,11 @@ #include "../test_utils.h" #include "spatial_data.h" -#include -#include +#include #include #include #include +#include #if defined RAFT_NN_COMPILED #include #endif diff --git a/cpp/test/spatial/epsilon_neighborhood.cu b/cpp/test/spatial/epsilon_neighborhood.cu index c005549b04..515636ad8c 100644 --- a/cpp/test/spatial/epsilon_neighborhood.cu +++ b/cpp/test/spatial/epsilon_neighborhood.cu @@ -17,9 +17,9 @@ #include "../test_utils.h" #include #include -#include #include #include +#include #include namespace raft { diff --git a/cpp/test/spatial/faiss_mr.cu b/cpp/test/spatial/faiss_mr.cu index eee221cffa..91ba1cc94c 100644 --- a/cpp/test/spatial/faiss_mr.cu +++ b/cpp/test/spatial/faiss_mr.cu @@ -17,7 +17,7 @@ #include "../test_utils.h" #include -#include +#include #include #include diff --git a/cpp/test/spatial/fused_l2_knn.cu b/cpp/test/spatial/fused_l2_knn.cu index bb0b3a63d7..ef032ed442 100644 --- a/cpp/test/spatial/fused_l2_knn.cu +++ b/cpp/test/spatial/fused_l2_knn.cu @@ -19,7 +19,7 @@ #include #include -#include +#include #include #include #include diff --git a/cpp/test/spatial/haversine.cu b/cpp/test/spatial/haversine.cu index 473d1e31da..78bd377156 100644 --- a/cpp/test/spatial/haversine.cu +++ b/cpp/test/spatial/haversine.cu @@ -17,7 +17,7 @@ #include "../test_utils.h" #include #include -#include +#include #include #include #include diff --git a/cpp/test/spatial/knn.cu b/cpp/test/spatial/knn.cu index 37e0edb6ab..3f91242930 100644 --- a/cpp/test/spatial/knn.cu +++ b/cpp/test/spatial/knn.cu @@ -17,7 +17,7 @@ #include "../test_utils.h" #include -#include +#include #include #if defined RAFT_NN_COMPILED #include diff --git a/cpp/test/spatial/selection.cu b/cpp/test/spatial/selection.cu index b669ba39d1..7b1f92f182 100644 --- a/cpp/test/spatial/selection.cu +++ b/cpp/test/spatial/selection.cu @@ -17,8 +17,8 @@ #include #include #include -#include #include +#include #include "../test_utils.h" diff --git a/cpp/test/spectral_matrix.cu b/cpp/test/spectral_matrix.cu index 2e2d918016..867b1e9daf 100644 --- a/cpp/test/spectral_matrix.cu +++ b/cpp/test/spectral_matrix.cu @@ -17,7 +17,7 @@ #include #include #include -#include +#include #include diff --git a/cpp/test/stats/adjusted_rand_index.cu b/cpp/test/stats/adjusted_rand_index.cu index 4bacbadbf7..473972ace4 100644 --- a/cpp/test/stats/adjusted_rand_index.cu +++ b/cpp/test/stats/adjusted_rand_index.cu @@ -18,9 +18,9 @@ #include #include #include -#include #include #include +#include #include namespace raft { diff --git a/cpp/test/stats/completeness_score.cu b/cpp/test/stats/completeness_score.cu index f0f06614e3..6f6b5a8afb 100644 --- a/cpp/test/stats/completeness_score.cu +++ b/cpp/test/stats/completeness_score.cu @@ -17,10 +17,10 @@ #include #include #include -#include #include #include #include +#include #include namespace raft { diff --git a/cpp/test/stats/contingencyMatrix.cu b/cpp/test/stats/contingencyMatrix.cu index 5c8d6da566..4785c739ed 100644 --- a/cpp/test/stats/contingencyMatrix.cu +++ b/cpp/test/stats/contingencyMatrix.cu @@ -18,9 +18,9 @@ #include #include #include -#include #include #include +#include #include #include diff --git a/cpp/test/stats/cov.cu b/cpp/test/stats/cov.cu index d9cc3ec8be..4ed2215d91 100644 --- a/cpp/test/stats/cov.cu +++ b/cpp/test/stats/cov.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" #include -#include #include #include #include +#include #include namespace raft { diff --git a/cpp/test/stats/dispersion.cu b/cpp/test/stats/dispersion.cu index b8fd9dfe80..afad286e98 100644 --- a/cpp/test/stats/dispersion.cu +++ b/cpp/test/stats/dispersion.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" #include -#include #include #include #include +#include #include #include #include diff --git a/cpp/test/stats/entropy.cu b/cpp/test/stats/entropy.cu index fb9e82058e..a3703bdb14 100644 --- a/cpp/test/stats/entropy.cu +++ b/cpp/test/stats/entropy.cu @@ -17,9 +17,9 @@ #include #include #include -#include #include #include +#include #include #include diff --git a/cpp/test/stats/histogram.cu b/cpp/test/stats/histogram.cu index f09c01c84a..58a3f5eaeb 100644 --- a/cpp/test/stats/histogram.cu +++ b/cpp/test/stats/histogram.cu @@ -16,11 +16,11 @@ #include "../test_utils.h" #include -#include -#include #include #include #include +#include +#include namespace raft { namespace stats { diff --git a/cpp/test/stats/homogeneity_score.cu b/cpp/test/stats/homogeneity_score.cu index 697cea55ad..729863003d 100644 --- a/cpp/test/stats/homogeneity_score.cu +++ b/cpp/test/stats/homogeneity_score.cu @@ -17,9 +17,9 @@ #include #include #include -#include #include #include +#include #include namespace raft { diff --git a/cpp/test/stats/information_criterion.cu b/cpp/test/stats/information_criterion.cu index d61f8591a5..5900730ede 100644 --- a/cpp/test/stats/information_criterion.cu +++ b/cpp/test/stats/information_criterion.cu @@ -18,8 +18,8 @@ #include -#include -#include +#include +#include #include #include diff --git a/cpp/test/stats/kl_divergence.cu b/cpp/test/stats/kl_divergence.cu index d66a832e30..e25f1c3bc5 100644 --- a/cpp/test/stats/kl_divergence.cu +++ b/cpp/test/stats/kl_divergence.cu @@ -17,8 +17,8 @@ #include #include #include -#include #include +#include #include namespace raft { diff --git a/cpp/test/stats/mean.cu b/cpp/test/stats/mean.cu index b7f24d5642..bec7a3adce 100644 --- a/cpp/test/stats/mean.cu +++ b/cpp/test/stats/mean.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" #include -#include -#include #include #include +#include +#include #include #include diff --git a/cpp/test/stats/mean_center.cu b/cpp/test/stats/mean_center.cu index 3d92a52fb4..c4f979d82e 100644 --- a/cpp/test/stats/mean_center.cu +++ b/cpp/test/stats/mean_center.cu @@ -17,10 +17,10 @@ #include "../linalg/matrix_vector_op.cuh" #include "../test_utils.h" #include -#include #include #include #include +#include namespace raft { namespace stats { diff --git a/cpp/test/stats/meanvar.cu b/cpp/test/stats/meanvar.cu index 65d33e331c..74e52e670d 100644 --- a/cpp/test/stats/meanvar.cu +++ b/cpp/test/stats/meanvar.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" #include -#include #include #include #include +#include #include diff --git a/cpp/test/stats/minmax.cu b/cpp/test/stats/minmax.cu index 532932b6ba..0468ebb177 100644 --- a/cpp/test/stats/minmax.cu +++ b/cpp/test/stats/minmax.cu @@ -17,10 +17,10 @@ #include "../test_utils.h" #include #include -#include -#include #include #include +#include +#include #include #include diff --git a/cpp/test/stats/mutual_info_score.cu b/cpp/test/stats/mutual_info_score.cu index ad4ec900c9..6bf3e6623f 100644 --- a/cpp/test/stats/mutual_info_score.cu +++ b/cpp/test/stats/mutual_info_score.cu @@ -17,9 +17,9 @@ #include #include #include -#include #include #include +#include #include namespace raft { diff --git a/cpp/test/stats/rand_index.cu b/cpp/test/stats/rand_index.cu index f1ec58d944..ca1c4dd5e8 100644 --- a/cpp/test/stats/rand_index.cu +++ b/cpp/test/stats/rand_index.cu @@ -16,7 +16,7 @@ #include "../test_utils.h" -#include +#include #include diff --git a/cpp/test/stats/silhouette_score.cu b/cpp/test/stats/silhouette_score.cu index 8542276bd7..f885c1034f 100644 --- a/cpp/test/stats/silhouette_score.cu +++ b/cpp/test/stats/silhouette_score.cu @@ -17,8 +17,8 @@ #include #include #include -#include -#include +#include +#include #if defined RAFT_DISTANCE_COMPILED && defined RAFT_NN_COMPILED #include diff --git a/cpp/test/stats/stddev.cu b/cpp/test/stats/stddev.cu index 0521209e98..70d99c2aeb 100644 --- a/cpp/test/stats/stddev.cu +++ b/cpp/test/stats/stddev.cu @@ -16,11 +16,11 @@ #include "../test_utils.h" #include -#include #include #include #include #include +#include namespace raft { namespace stats { diff --git a/cpp/test/stats/sum.cu b/cpp/test/stats/sum.cu index b80c66831d..7a16dbde4a 100644 --- a/cpp/test/stats/sum.cu +++ b/cpp/test/stats/sum.cu @@ -16,10 +16,10 @@ #include "../test_utils.h" -#include -#include +#include #include #include +#include #include diff --git a/cpp/test/stats/trustworthiness.cu b/cpp/test/stats/trustworthiness.cu index a963957d32..ae596d0535 100644 --- a/cpp/test/stats/trustworthiness.cu +++ b/cpp/test/stats/trustworthiness.cu @@ -17,8 +17,8 @@ #include "../test_utils.h" #include #include -#include #include +#include #if defined RAFT_DISTANCE_COMPILED && defined RAFT_NN_COMPILED #include diff --git a/cpp/test/stats/v_measure.cu b/cpp/test/stats/v_measure.cu index 65a875c5e0..22dcefba0c 100644 --- a/cpp/test/stats/v_measure.cu +++ b/cpp/test/stats/v_measure.cu @@ -17,9 +17,9 @@ #include #include #include -#include #include #include +#include #include namespace raft { diff --git a/cpp/test/stats/weighted_mean.cu b/cpp/test/stats/weighted_mean.cu index 9f3e6a79f6..5ff8454490 100644 --- a/cpp/test/stats/weighted_mean.cu +++ b/cpp/test/stats/weighted_mean.cu @@ -16,9 +16,9 @@ #include "../test_utils.h" #include -#include #include #include +#include #include #include diff --git a/cpp/test/test_utils.h b/cpp/test/test_utils.h index 196b0cd0a8..14319b85e1 100644 --- a/cpp/test/test_utils.h +++ b/cpp/test/test_utils.h @@ -18,8 +18,8 @@ #include #include #include -#include -#include +#include +#include #include #include diff --git a/docs/source/conf.py b/docs/source/conf.py index a96f86c68d..9aa3a19310 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -214,9 +214,9 @@ def setup(app): # The following is used by sphinx.ext.linkcode to provide links to github linkcode_resolve = make_linkcode_resolve( - "raft", + "pylibraft", "https://github.com/rapidsai/raft" - "raft/blob/{revision}/python/" + "raft/blob/{revision}/python/pylibraft" "{package}/{path}#L{lineno}", ) diff --git a/docs/source/cpp_api/clustering.rst b/docs/source/cpp_api/clustering.rst index 715275b59a..90ca786cc1 100644 --- a/docs/source/cpp_api/clustering.rst +++ b/docs/source/cpp_api/clustering.rst @@ -3,23 +3,10 @@ Clustering This page provides C++ class references for the publicly-exposed elements of the clustering package. -K-Means -####### - .. doxygennamespace:: raft::cluster :project: RAFT :members: -Spectral -######## - .. doxygennamespace:: raft::spectral :project: RAFT - :members: - -Hierarchical -############ - -.. doxygennamespace:: raft::hierarchy - :project: RAFT - :members: + :members: \ No newline at end of file diff --git a/docs/source/cpp_api/random.rst b/docs/source/cpp_api/random.rst index 8635855484..be2c188617 100644 --- a/docs/source/cpp_api/random.rst +++ b/docs/source/cpp_api/random.rst @@ -3,10 +3,30 @@ Random This page provides C++ class references for the publicly-exposed elements of the random package. -.. doxygennamespace:: raft::random +Data Generation +############### + +.. doxygenfunction:: raft::random::make_blobs :project: RAFT - :members: + +.. doxygenfunction:: raft::random::make_regression + :project: RAFT + +.. doxygenfunction:: raft::random::rmat_rectangular_gen + :project: RAFT + + +Random Number Generation +######################## .. doxygenclass:: raft::random::Rng :project: RAFT :members: + +Useful Operations +################# + +.. doxygenfunction:: raft::random::permute + :project: RAFT + + diff --git a/docs/source/cpp_api/spatial.rst b/docs/source/cpp_api/spatial.rst index 243bf19bf7..9bda00dab7 100644 --- a/docs/source/cpp_api/spatial.rst +++ b/docs/source/cpp_api/spatial.rst @@ -13,10 +13,14 @@ Distance Nearest Neighbors ################# -.. doxygennamespace:: raft::spatial::knn +.. doxygenfunction:: raft::spatial::knn::brute_force_knn + :project: RAFT + +.. doxygenfunction:: raft::spatial::knn::select_k :project: RAFT - :members: +.. doxygenfunction:: raft::spatial::knn::knn_merge_parts + :project: RAFT IVF-Flat