Add dispatch based on compute architecture #1295

ahendriksen · 2023-02-22T15:35:49Z

This PR improves the ability to do dispatch based on compute architecture. It is a follow up to #1142.

It has two goals:

Make it easier to specify which compute architectures a kernel is compatible with / should be compiled for.
Make it easier to compile a kernel only for the architectures for which it is used (if it is unused, the kernel should be empty).

We have a specific use case in RAFT for this feature. For the L2 pairwise distance kernel we have a CUTLASS based implementation that works om SM80+ and a fallback kernel. Preferably, each kernel is only compiled for the architectures on which it is actually used.

The previous approach can be described as follows (using psuedo code):

template <typename DistanceOpT>
__global__ void generic_kernel(DistanceOpT op, other arguments .. ) { .. implementation .. }

__global__ void cutlass_kernel_l2(args.. ) {.. implementation }

template <typename DistanceOpT>
__global__ void generic_kernel_prior_to_sm80(DistanceOpT op, args..) {
#if __CUDA_ARCH__ < 800
  .. copy and paste generic_kernel implementation
#endif
}

template <typename DistanceOpT>
void dispatch(DistanceOpT op, args..) {
  if (! op == l2_distance_op) {
    // run normal generic kernel
    generic_kernel(op, args..);
  } else {
    if (device_capability() >= 800) {
      cutlass_kernel_l2(args...);
    } else {
      generic_kernel_prior_to_sm80(op, args...);
    }
  }
}

The main issue is that the generic kernel is copy pasted. In the new approach, the compute architectures for which the generic kernel has to be compiled (the "compatibility range") is given as an argument (using a compile-time tag type). This allows comparing to the compatibility range inside the kernel and early exiting if the architecture for which it is currently compiling is not supported. Therefore, we do not need to copy and paste generic kernel.

template <typename DistanceOpT, typename SM_compat_t>
__global__ void generic_kernel(DistanceOpT op, SM_compat_t sm_compat_range, args .. ) {
  // Early exit to minimize the size of the kernel when it is not supposed to be compiled.
  if constexpr(! sm_compat_range.contains(raft::arch::SM_compute_arch())) {
    assert(false);
    return;
  }
  .. rest of implementation ..
}

__global__ void cutlass_kernel_l2(args.. ) {.. implementation }

template <typename DistanceOpT>
void dispatch(DistanceOpT op, args..) {
  if (! op == l2_distance_op) {
    // run normal generic kernel and compile for all architectures:
    auto full_range = raft::arch::SM_range(raft::arch::SM_min(), raft::arch::SM_future());
    generic_kernel(op, full_range, args..);
  } else {
    // Get current architecture of device at runtime
    auto runtime_arch = raft::arch::kernel_runtime_arch();
    // Define compatibility ranges for the cutlass and generic kernel
    auto cutlass_range = raft::arch::SM_range(raft::arch::SM_80(), raft::arch::SM_future());
    auto legacy_range = raft::arch::SM_range(raft::arch::SM_min(), raft::arch::SM_80());

    if (cutlass_range.contains(runtime_arch)) {
      // On SM80+: run cutlass kernel. (this might actually also compile a non-trivial kernel
      // for SM70 but that is unfortunately outside our control)
      cutlass_kernel_l2(args...);
    } else {
      // This will run on architectures < SM80. Also, the compiled kernels from
      // SM80 and higher will be empty (< 10 instructions).
      generic_kernel(op, legacy_range, args...);
    }
  }
}

…-dispatch

tfeher

Thanks Allard for this PR! It indeed provides a cleaner method to dispatch kernels based on GPU arch, and at the same time enables us to compile only for the intended architectures. Overall it looks good, see a few comments below.

cpp/include/raft/util/arch.cuh

…-dispatch

ahendriksen · 2023-03-07T13:24:52Z

NOTE: this PR is a follow up to #1142. To keep the diff minimal, I have set the base branch to be the previous PR (instead of 23.04). This should be changed before merging :)

robertmaynard · 2023-03-07T13:47:44Z

cpp/include/raft/util/arch.cuh

+};
+
+// A dummy kernel that is used to determine the runtime architecture.
+__global__ inline void dummy_runtime_kernel() {}


This needs to be static so we don't run into the issue where multiple consumers of raft build with different arch values and we get incorrect kernel selection.

For more info see: NVIDIA/cub#545

That's a good point. It looks like the dummy kernel approach requires making the kernel static to get a reliable solution, at the cost of littering the final binary with many empty kernels.

In kernel_runtime_arch, we are currently taking a pointer to the dummy_runtime_kernel. If instead, we took a runtime argument that was a pointer to one of the candidate kernels that is going to be called, would that solve the problem? That is, I would remove the dummy_runtime_kernel and the kernel pointer would have to be provided by the user. I think it does solve the linking problem that you described above and it doesn't create spurious kernels, but I want to double check before I change the code.

Requiring a kernel pointer would work as well since we would now be querying based a specific kernel that was only compiled once.

Thanks a lot! I will go for that direction then.

I'm a little late to the party, but I came up with an idea for an alternative way of doing this that I like better because it avoids the empty kernel. See https://github.com/NVIDIA/cub/issues/556

Thanks for the pointer! I've been meaning to respond to this for a while, but never found the time to test my assertions.

We are currently (that is: in the PR that was merged) avoiding the empty kernel by forcing the caller to provide a pointer to one of the kernel versions. We then query the func attributes of that kernel.

The __CUDA_ARCH_LIST__ looks like a worthwile approach. However, it may break when kernels are weakly linked (e.g. templated). You describe the issue very well in #1722. I had not considered outlawing weak linking completely.. Let's see how that goes!

…ernel implementations (#1142) The pairwise distance metrics are quite varied. The table below summarizes the differences, in terms of - Epilog : whether the metric has a non-empty epilog operation. - Uses norms: whether the metric requires precalculation of the norms of the vectors. - Has params: whether the norm has additional parameters. The L2 metric, for instance, has the `sqrt` boolean parameter that determines whether to calculate the squared or actual distance. - Pre- & post-processing: For some metrics, the norms have to be precalculated. For other metrics, the input matrices are transformed before the kernel launch, and "untransformed" after. - Expensive inner loop: some metrics use `pow`, `log` or other expensive functions in the inner loop. - Depends on row-major: the calculation of some metrics depend on whether the input is row-major. - CUTLASS: some metrics have an implementation using CUTLASS and tensor cores. <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <colgroup> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">Metric</th> <th scope="col" class="org-left">Epilog</th> <th scope="col" class="org-left">Uses norms</th> <th scope="col" class="org-left">Has params</th> <th scope="col" class="org-left">Pre- & post-processing</th> <th scope="col" class="org-left">Expensive inner loop</th> <th scope="col" class="org-left">Depends on row-major</th> <th scope="col" class="org-left">CUTLASS</th> </tr> </thead> <tbody> <tr> <td class="org-left">Canberra</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Chebyshev (Linf)</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Correlation</td> <td class="org-left">x</td> <td class="org-left">x (twice)</td> <td class="org-left">x (many)</td> <td class="org-left">compute norms</td> <td class="org-left"> </td> <td class="org-left">x</td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Cosine</td> <td class="org-left">x</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">compute norms</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left">x</td> </tr> <tr> <td class="org-left">Hamming</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">x (k)</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Hellinger</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left">sqrt and square</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Jensen Shannon</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">KL divergence</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">x (row major, x == y)</td> <td class="org-left">yes</td> <td class="org-left">x</td> <td class="org-left">x</td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">L1</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">L2 expanded</td> <td class="org-left">x</td> <td class="org-left">x</td> <td class="org-left">x (sqrt)</td> <td class="org-left">compute norms</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left">x</td> </tr> <tr> <td class="org-left">L2 unexpanded</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">x (sqrt)</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Minkowski (Lp)</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">x (p)</td> <td class="org-left"> </td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Russel-Rao</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">x (k, 1/k)</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> </tbody> </table> To keep the complexity that results from all these differences in check, there are several layers between the public API and the kernel launch, each with their own responsibility. ## Before 1. `raft::distance::pairwise_distance` takes distance type as a run-time argument and dispatches to `raft::distance::detail::pairwise_distance_impl`. 2. `raft::distance::detail::pairwise_distance_impl` allocates workspace as necessary and calls `raft::distance::detail::distance` 3. `raft::distance::detail::distance` defines a default final operation (the identity) and calls an overload of itself. 4. `raft::distance::detail::distance` (with `fin_op`) initializes a `DistanceImpl` zero-sized struct with the correct template arguments and runs the `.run()` method of the struct. 5. `raft::distance::detail::DistanceImpl<DistanceType>.run()` calls `raft::distance::detail::XX_Impl`. 6. `raft::distance::detail::XX_Impl` has the following responsibilities: - Pre-compute norms if necessary - Transform input if necessary - If metric supports a CUTLASS operation, dispatch if necessary. - Swap inputs if column-major. - Based on runtime parameter `row_major` dispatch to function template `raft::distance::detail::XX<bool row_major>` 7. `raft::distance::detail::XX` based on alignment of input data dispatch to function template `raft::distance::detail::XX_Impl<int veclen>` (different overload of previous `raft::distance::detail::XX_Impl`) 8. `raft::distance::detail::XX_Impl` has the following responsibilities: - Define `core_op` and `epilog_op` - Define `use_norms` - Launch kernel `pairwiseDistanceMatKernel` with correct launch parameters **Observations**: - Steps 6 and 7 both convert a runtime value to a compile time constant (row-major layout and alignment). - Step 7 is repeated (copy pasted) for each metric. - Steps 7 and 8 do a lot of different things and the steps in between do relatively little. - Steps 1-5 do fairly little (but require a lot of boilerplate) **Proposal**: 1. Collect as much of the runtime behavior of each metric in a `distance_op` that [contains](https://github.com/ahendriksen/raft/blob/wip-refactor-distance/cpp/include/raft/distance/detail/distance_ops/canberra.cuh): - The core_op - The epilog_op - The required shared memory - Whether the inner loop is expensive (and thus loop unrolling should be curtailed) 2. Collect the runtime -> compile-time dispatch in one location ([dispatch.cuh](https://github.com/ahendriksen/raft/blob/486393eff4e0cf1d45ab9d7990b64d607e835d70/cpp/include/raft/distance/detail/pairwise_matrix/dispatch.cuh#L70)) 3. Collect kernel launching in one [location](https://github.com/ahendriksen/raft/blob/486393eff4e0cf1d45ab9d7990b64d607e835d70/cpp/include/raft/distance/detail/pairwise_matrix/kernel_sm60.cuh#L108) 4. Remove some of the boilerplate in steps 1-5. ## After 1. `raft::distance::pairwise_distance` takes distance type as a run-time argument, allocates workspace as necessary, and dispatches to `raft::distance::detail::distance`. 2. `raft::distance::detail::distance` defines a default final operation (the identity) and calls an overload of itself. 3. `raft::distance::detail::distance` (with `fin_op`) calls an overload of `raft::distance::detail::distance_impl` for the correct distance type. 4. `raft::distance::detail::distance_impl` has the following responsibilities: - Pre-compute norms if necessary - Initialize distance op with parameters as necessary, see below for more information. - Transform input if necessary - If metric supports a CUTLASS operation, dispatch if necessary. - Dispatch to `raft::distance::detail::distance_matrix_dispatch` 5. `raft::distance::detail::distance_matrix_dispatch` has the following responsibilities: - swap x, y matrices if column major - dispatch to correct kernel based on run-time parameters `row_major` and `vec_len` - Determine kernel policy based on parameters - Call `raft::distance::detail::pairwise_matrix` 6. `raft::distance::detail::pairwise_matrix` launches the `raft::distance::detail::pairwise_matrix_kernel` with the correct launch parameters. **Distance_op** `raft::distance::detail::ops::XX_distance_op` [[example]](https://github.com/ahendriksen/raft/blob/wip-refactor-distance/cpp/include/raft/distance/detail/distance_ops/canberra.cuh) has the following responsibilities: - Take any parameters (sqrt, k, etc) - Define `core_op` and `epilog_op` - Define `use_norms`, `expensive_inner_loop`, and `shared_mem_size()`. Still TODO: - [x] Rename Minkowski and Chebyshev to Lp and Linf. - [x] Do something with this note in the comments: "if workspace is passed as nullptr, this will return in worksize, the number of bytes of workspace required", which is wrong. - [x] Add a mechanism to limit duplicate compilation when a CUTLASS kernel is available. This is done in follow up PR #1295. - [x] Some distance_ops have additional template parameters. This must be cleared up. Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: #1142

ahendriksen · 2023-03-10T09:44:23Z

rapids-bot closed this PR. I set the target of this PR to #1142 (so that the diff was reasonably small). Now, I cannot reopen this PR (because the target branch has been deleted) or retarget it (because this PR is closed).

I fear I have to open a new PR. I will get back to your reviews! Apologies for the inconvenience.

…ernel implementations (rapidsai#1142) The pairwise distance metrics are quite varied. The table below summarizes the differences, in terms of - Epilog : whether the metric has a non-empty epilog operation. - Uses norms: whether the metric requires precalculation of the norms of the vectors. - Has params: whether the norm has additional parameters. The L2 metric, for instance, has the `sqrt` boolean parameter that determines whether to calculate the squared or actual distance. - Pre- & post-processing: For some metrics, the norms have to be precalculated. For other metrics, the input matrices are transformed before the kernel launch, and "untransformed" after. - Expensive inner loop: some metrics use `pow`, `log` or other expensive functions in the inner loop. - Depends on row-major: the calculation of some metrics depend on whether the input is row-major. - CUTLASS: some metrics have an implementation using CUTLASS and tensor cores. <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <colgroup> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> <col class="org-left" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">Metric</th> <th scope="col" class="org-left">Epilog</th> <th scope="col" class="org-left">Uses norms</th> <th scope="col" class="org-left">Has params</th> <th scope="col" class="org-left">Pre- & post-processing</th> <th scope="col" class="org-left">Expensive inner loop</th> <th scope="col" class="org-left">Depends on row-major</th> <th scope="col" class="org-left">CUTLASS</th> </tr> </thead> <tbody> <tr> <td class="org-left">Canberra</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Chebyshev (Linf)</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Correlation</td> <td class="org-left">x</td> <td class="org-left">x (twice)</td> <td class="org-left">x (many)</td> <td class="org-left">compute norms</td> <td class="org-left"> </td> <td class="org-left">x</td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Cosine</td> <td class="org-left">x</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">compute norms</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left">x</td> </tr> <tr> <td class="org-left">Hamming</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">x (k)</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Hellinger</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left">sqrt and square</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Jensen Shannon</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">KL divergence</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">x (row major, x == y)</td> <td class="org-left">yes</td> <td class="org-left">x</td> <td class="org-left">x</td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">L1</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">L2 expanded</td> <td class="org-left">x</td> <td class="org-left">x</td> <td class="org-left">x (sqrt)</td> <td class="org-left">compute norms</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left">x</td> </tr> <tr> <td class="org-left">L2 unexpanded</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">x (sqrt)</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Minkowski (Lp)</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">x (p)</td> <td class="org-left"> </td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> <tr> <td class="org-left">Russel-Rao</td> <td class="org-left">x</td> <td class="org-left"> </td> <td class="org-left">x (k, 1/k)</td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> <td class="org-left"> </td> </tr> </tbody> </table> To keep the complexity that results from all these differences in check, there are several layers between the public API and the kernel launch, each with their own responsibility. ## Before 1. `raft::distance::pairwise_distance` takes distance type as a run-time argument and dispatches to `raft::distance::detail::pairwise_distance_impl`. 2. `raft::distance::detail::pairwise_distance_impl` allocates workspace as necessary and calls `raft::distance::detail::distance` 3. `raft::distance::detail::distance` defines a default final operation (the identity) and calls an overload of itself. 4. `raft::distance::detail::distance` (with `fin_op`) initializes a `DistanceImpl` zero-sized struct with the correct template arguments and runs the `.run()` method of the struct. 5. `raft::distance::detail::DistanceImpl<DistanceType>.run()` calls `raft::distance::detail::XX_Impl`. 6. `raft::distance::detail::XX_Impl` has the following responsibilities: - Pre-compute norms if necessary - Transform input if necessary - If metric supports a CUTLASS operation, dispatch if necessary. - Swap inputs if column-major. - Based on runtime parameter `row_major` dispatch to function template `raft::distance::detail::XX<bool row_major>` 7. `raft::distance::detail::XX` based on alignment of input data dispatch to function template `raft::distance::detail::XX_Impl<int veclen>` (different overload of previous `raft::distance::detail::XX_Impl`) 8. `raft::distance::detail::XX_Impl` has the following responsibilities: - Define `core_op` and `epilog_op` - Define `use_norms` - Launch kernel `pairwiseDistanceMatKernel` with correct launch parameters **Observations**: - Steps 6 and 7 both convert a runtime value to a compile time constant (row-major layout and alignment). - Step 7 is repeated (copy pasted) for each metric. - Steps 7 and 8 do a lot of different things and the steps in between do relatively little. - Steps 1-5 do fairly little (but require a lot of boilerplate) **Proposal**: 1. Collect as much of the runtime behavior of each metric in a `distance_op` that [contains](https://github.com/ahendriksen/raft/blob/wip-refactor-distance/cpp/include/raft/distance/detail/distance_ops/canberra.cuh): - The core_op - The epilog_op - The required shared memory - Whether the inner loop is expensive (and thus loop unrolling should be curtailed) 2. Collect the runtime -> compile-time dispatch in one location ([dispatch.cuh](https://github.com/ahendriksen/raft/blob/486393eff4e0cf1d45ab9d7990b64d607e835d70/cpp/include/raft/distance/detail/pairwise_matrix/dispatch.cuh#L70)) 3. Collect kernel launching in one [location](https://github.com/ahendriksen/raft/blob/486393eff4e0cf1d45ab9d7990b64d607e835d70/cpp/include/raft/distance/detail/pairwise_matrix/kernel_sm60.cuh#L108) 4. Remove some of the boilerplate in steps 1-5. ## After 1. `raft::distance::pairwise_distance` takes distance type as a run-time argument, allocates workspace as necessary, and dispatches to `raft::distance::detail::distance`. 2. `raft::distance::detail::distance` defines a default final operation (the identity) and calls an overload of itself. 3. `raft::distance::detail::distance` (with `fin_op`) calls an overload of `raft::distance::detail::distance_impl` for the correct distance type. 4. `raft::distance::detail::distance_impl` has the following responsibilities: - Pre-compute norms if necessary - Initialize distance op with parameters as necessary, see below for more information. - Transform input if necessary - If metric supports a CUTLASS operation, dispatch if necessary. - Dispatch to `raft::distance::detail::distance_matrix_dispatch` 5. `raft::distance::detail::distance_matrix_dispatch` has the following responsibilities: - swap x, y matrices if column major - dispatch to correct kernel based on run-time parameters `row_major` and `vec_len` - Determine kernel policy based on parameters - Call `raft::distance::detail::pairwise_matrix` 6. `raft::distance::detail::pairwise_matrix` launches the `raft::distance::detail::pairwise_matrix_kernel` with the correct launch parameters. **Distance_op** `raft::distance::detail::ops::XX_distance_op` [[example]](https://github.com/ahendriksen/raft/blob/wip-refactor-distance/cpp/include/raft/distance/detail/distance_ops/canberra.cuh) has the following responsibilities: - Take any parameters (sqrt, k, etc) - Define `core_op` and `epilog_op` - Define `use_norms`, `expensive_inner_loop`, and `shared_mem_size()`. Still TODO: - [x] Rename Minkowski and Chebyshev to Lp and Linf. - [x] Do something with this note in the comments: "if workspace is passed as nullptr, this will return in worksize, the number of bytes of workspace required", which is wrong. - [x] Add a mechanism to limit duplicate compilation when a CUTLASS kernel is available. This is done in follow up PR rapidsai#1295. - [x] Some distance_ops have additional template parameters. This must be cleared up. Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1142

ahendriksen requested review from a team as code owners February 22, 2023 15:35

github-actions bot added CMake cpp labels Feb 22, 2023

ahendriksen added 2 commits February 22, 2023 16:37

Add dispatch based on compute architecture

749d000

Fix style

7262861

ahendriksen force-pushed the enh-arch-dispatch branch from 405b817 to 7262861 Compare February 22, 2023 15:39

ahendriksen added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed CMake labels Feb 22, 2023

ahendriksen changed the base branch from branch-23.04 to pull-request/1142 February 22, 2023 15:40

ahendriksen requested review from tfeher and cjnolet February 22, 2023 15:41

ahendriksen mentioned this pull request Feb 22, 2023

Simplify distance/detail to make is easier to dispatch to different kernel implementations #1142

Merged

4 tasks

ahendriksen added the 3 - Ready for Review label Feb 23, 2023

Merge remote-tracking branch 'rapids/pull-request/1142' into enh-arch…

1ef8520

…-dispatch

ahendriksen mentioned this pull request Feb 28, 2023

Reduce compile times of distance specializations #1307

Merged

cjnolet assigned ahendriksen Feb 28, 2023

Fix linker error: multiple definition..

09a3050

tfeher requested changes Mar 7, 2023

View reviewed changes

cpp/include/raft/util/arch.cuh Outdated Show resolved Hide resolved

cpp/include/raft/util/arch.cuh Outdated Show resolved Hide resolved

ahendriksen added 2 commits March 7, 2023 12:39

Merge remote-tracking branch 'rapids/pull-request/1142' into enh-arch…

f8daf48

…-dispatch

Implement review feedback

1a6636f

robertmaynard requested changes Mar 7, 2023

View reviewed changes

ahendriksen added the 5 - Merge After Dependencies Depends on another PR: do not merge out of order label Mar 7, 2023

rapids-bot bot deleted the branch rapidsai:pull-request/1142 March 10, 2023 09:38

rapids-bot bot closed this Mar 10, 2023

ahendriksen mentioned this pull request Mar 13, 2023

Add dispatch based on compute architecture #1335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dispatch based on compute architecture #1295

Add dispatch based on compute architecture #1295

ahendriksen commented Feb 22, 2023

tfeher left a comment

ahendriksen commented Mar 7, 2023

robertmaynard Mar 7, 2023

ahendriksen Mar 7, 2023

robertmaynard Mar 7, 2023

ahendriksen Mar 7, 2023

jrhemstad Jun 15, 2023

ahendriksen Aug 8, 2023

ahendriksen commented Mar 10, 2023

Add dispatch based on compute architecture #1295

Add dispatch based on compute architecture #1295

Conversation

ahendriksen commented Feb 22, 2023

tfeher left a comment

Choose a reason for hiding this comment

ahendriksen commented Mar 7, 2023

robertmaynard Mar 7, 2023

Choose a reason for hiding this comment

ahendriksen Mar 7, 2023

Choose a reason for hiding this comment

robertmaynard Mar 7, 2023

Choose a reason for hiding this comment

ahendriksen Mar 7, 2023

Choose a reason for hiding this comment

jrhemstad Jun 15, 2023

Choose a reason for hiding this comment

ahendriksen Aug 8, 2023

Choose a reason for hiding this comment

ahendriksen commented Mar 10, 2023