Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Fix SparseAttention Kernel #20716

Merged
merged 2 commits into from
May 18, 2024

Conversation

tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented May 17, 2024

Description

Currently, there is one bool flag to indicate whether kernel is loaded. However, there are v1 and v2 kernels, so the flag will allow only one version of kernel loaded. We use v1 kernel for prompt and v2 kernel for token generation, and the flag will cause issue when we want both prompt and token generation.

This bug is found in integration test. The unit test only test one kernel at a time so the issue was not found before.

Another possible walkaround without this fix is to set an environment variable ORT_DISABLE_SPARSE_ATTENTION_V1=1

Motivation and Context

@hanbitmyths hanbitmyths merged commit 2e7de54 into main May 18, 2024
96 checks passed
@hanbitmyths hanbitmyths deleted the tlwu/fix_sparse_attention_kernel_load branch May 18, 2024 05:42
tianleiwu added a commit that referenced this pull request May 21, 2024
### Description

Currently, there is one bool flag to indicate whether kernel is loaded.
However, there are v1 and v2 kernels, so the flag will allow only one
version of kernel loaded. We use v1 kernel for prompt and v2 kernel for
token generation, and the flag will cause issue when we want both prompt
and token generation.

This bug is found in integration test. The unit test only test one
kernel at a time so the issue was not found before.

Another possible walkaround without this fix is to set an environment
variable `ORT_DISABLE_SPARSE_ATTENTION_V1=1`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@sophies927 sophies927 added the triage:approved Approved for cherrypicks for release label Jun 11, 2024
baijumeswani pushed a commit that referenced this pull request Jun 20, 2024
### Description

Currently, there is one bool flag to indicate whether kernel is loaded.
However, there are v1 and v2 kernels, so the flag will allow only one
version of kernel loaded. We use v1 kernel for prompt and v2 kernel for
token generation, and the flag will cause issue when we want both prompt
and token generation.

This bug is found in integration test. The unit test only test one
kernel at a time so the issue was not found before.

Another possible walkaround without this fix is to set an environment
variable `ORT_DISABLE_SPARSE_ATTENTION_V1=1`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:1.18.1 triage:approved Approved for cherrypicks for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants