[CUDA] Fix SparseAttention Kernel #20716

tianleiwu · 2024-05-17T22:11:04Z

Description

Currently, there is one bool flag to indicate whether kernel is loaded. However, there are v1 and v2 kernels, so the flag will allow only one version of kernel loaded. We use v1 kernel for prompt and v2 kernel for token generation, and the flag will cause issue when we want both prompt and token generation.

This bug is found in integration test. The unit test only test one kernel at a time so the issue was not found before.

Another possible walkaround without this fix is to set an environment variable ORT_DISABLE_SPARSE_ATTENTION_V1=1

Motivation and Context

### Description Currently, there is one bool flag to indicate whether kernel is loaded. However, there are v1 and v2 kernels, so the flag will allow only one version of kernel loaded. We use v1 kernel for prompt and v2 kernel for token generation, and the flag will cause issue when we want both prompt and token generation. This bug is found in integration test. The unit test only test one kernel at a time so the issue was not found before. Another possible walkaround without this fix is to set an environment variable `ORT_DISABLE_SPARSE_ATTENTION_V1=1` ### Motivation and Context

fix kernel load

5720aad

kunal-vaishnavi previously approved these changes May 17, 2024

View reviewed changes

lintrunner

be8bcb8

tianleiwu dismissed kunal-vaishnavi’s stale review via be8bcb8 May 17, 2024 22:34

kunal-vaishnavi approved these changes May 17, 2024

View reviewed changes

tianleiwu added the release:1.18.1 label May 18, 2024

hanbitmyths merged commit 2e7de54 into main May 18, 2024
96 checks passed

hanbitmyths deleted the tlwu/fix_sparse_attention_kernel_load branch May 18, 2024 05:42

sophies927 added the triage:approved Approved for cherrypicks for release label Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Fix SparseAttention Kernel #20716

[CUDA] Fix SparseAttention Kernel #20716

tianleiwu commented May 17, 2024 •

edited

Loading

[CUDA] Fix SparseAttention Kernel #20716

[CUDA] Fix SparseAttention Kernel #20716

Conversation

tianleiwu commented May 17, 2024 • edited Loading

Description

Motivation and Context

tianleiwu commented May 17, 2024 •

edited

Loading