Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][Performance] matrix::detail::select::radix: inefficient "one-block" kernel launch #1823

Closed
achirkin opened this issue Sep 13, 2023 · 0 comments · Fixed by #1878
Closed
Labels
bug Something isn't working

Comments

@achirkin
Copy link
Contributor

matrix::detail::select::radix contains two almost identical implementations of the radix MSD select algorithm. These are radix_kernel and radix_topk_one_block_kernel. The main difference is that the one-block kernel is tailored for somewhat larger batch sizes: it runs only one block per row, and thus does not require any inter-block communication. There is, however a two-fold problem with the one-block kernel:

  1. It uses the same function calc_chunk_size() to select the CUDA grid size as the normal kernel; as a result, sometimes very small grid sizes are selected and the algorithm runs at a very low occupancy in an inefficient loop over the input batch.
  2. It allocates the temporary buffers of the size input_row_length * gridDim * 2, which can become extremely large and inefficient if we fix (1). In contrast, the normal kernel has an optimization to use a limited-size temporary buffers.

By default, this problem is masked by the matrix::select_k heuristic. It just selects a faster legacy implementation borrowed from FAISS for the problematic input sizes.

@achirkin achirkin added the bug Something isn't working label Sep 13, 2023
@achirkin achirkin changed the title [Performance] matrix::detail::select::radix: inefficient "one-block" kernel launch [BUG][Performance] matrix::detail::select::radix: inefficient "one-block" kernel launch Sep 13, 2023
@rapids-bot rapids-bot bot closed this as completed in #1878 Nov 9, 2023
rapids-bot bot pushed a commit that referenced this issue Nov 9, 2023
- fix matrix::detail::select::radix::calc_chunk_size() for one-block kernel
- use `calc_buf_len()` rather than `len` as the buffer size of one-block kernel
- reduce register footprint of one-block kernel by reducing the number of buffer pointers
- reduce the buffer size by 1/8 for all radix select functions


Resolve #1823

Authors:
  - Yong Wang (https://github.com/yong-wang)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Ben Frederickson (https://github.com/benfred)
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Artem M. Chirkin (https://github.com/achirkin)
  - Ben Frederickson (https://github.com/benfred)

URL: #1878
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant