You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
matrix::detail::select::radix contains two almost identical implementations of the radix MSD select algorithm. These are radix_kernel and radix_topk_one_block_kernel. The main difference is that the one-block kernel is tailored for somewhat larger batch sizes: it runs only one block per row, and thus does not require any inter-block communication. There is, however a two-fold problem with the one-block kernel:
It uses the same function calc_chunk_size() to select the CUDA grid size as the normal kernel; as a result, sometimes very small grid sizes are selected and the algorithm runs at a very low occupancy in an inefficient loop over the input batch.
It allocates the temporary buffers of the size input_row_length * gridDim * 2, which can become extremely large and inefficient if we fix (1). In contrast, the normal kernel has an optimization to use a limited-size temporary buffers.
By default, this problem is masked by the matrix::select_k heuristic. It just selects a faster legacy implementation borrowed from FAISS for the problematic input sizes.
The text was updated successfully, but these errors were encountered:
matrix::detail::select::radix
contains two almost identical implementations of the radix MSD select algorithm. These areradix_kernel
andradix_topk_one_block_kernel
. The main difference is that the one-block kernel is tailored for somewhat larger batch sizes: it runs only one block per row, and thus does not require any inter-block communication. There is, however a two-fold problem with the one-block kernel:calc_chunk_size()
to select the CUDA grid size as the normal kernel; as a result, sometimes very small grid sizes are selected and the algorithm runs at a very low occupancy in an inefficient loop over the input batch.input_row_length * gridDim * 2
, which can become extremely large and inefficient if we fix (1). In contrast, the normal kernel has an optimization to use a limited-size temporary buffers.By default, this problem is masked by the
matrix::select_k
heuristic. It just selects a faster legacy implementation borrowed from FAISS for the problematic input sizes.The text was updated successfully, but these errors were encountered: