New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: non-contiguous query with paged kv cache #553

Merged

yzh119 merged 2 commits into flashinfer-ai:main from LinHeLurking:feature/non-contiguous-query

Oct 25, 2024

Contributor

LinHeLurking commented Oct 24, 2024

Motivation

Previously, only ragged version of prefill kernel supported non-contiguous query tensor (#404). But with paged kv cache, you have to make query tensor contiguous. Libraries like vLLM or SGLang must make query tensor contiguous before calling flashinfer kernels (vLLM call of flashinfer, SGLang call of flashinfer). This PR solves it, ensuring that prefill/decode kernels with paged kv cache support non-contiguous query tensor.

Main Changes

Add strides of query tensor in BatchPrefillPagedParams and BatchDecodeParams.
Set stride parameters before calling those kernels.
Modify JIT compiling templates to support new kernel parameters.
Add some tests.

The Python interfaces remain the same. Nothing changes except it accepts non-contiguous query tensors now!


          feat: non-contiguous query with paged kv cache

873c0aa

Signed-off-by: LinHeLurking <[email protected]>

reyoung reviewed

View reviewed changes

flashinfer-aot/csrc_aot/batch_prefill.cu Outdated Show resolved Hide resolved

reyoung reviewed

View reviewed changes

flashinfer-aot/csrc_aot/batch_prefill.cu Show resolved Hide resolved

reyoung reviewed

View reviewed changes

flashinfer-aot/csrc_aot/batch_prefill.cu Outdated Show resolved Hide resolved

reyoung reviewed

View reviewed changes

include/flashinfer/attention/decode_params.cuh Outdated Show resolved Hide resolved

reyoung reviewed

View reviewed changes

include/flashinfer/attention/decode.cuh Show resolved Hide resolved

reyoung reviewed

View reviewed changes

include/flashinfer/attention/prefill.cuh Outdated Show resolved Hide resolved


          code: clean up

c117ccd

Signed-off-by: LinHeLurking <[email protected]>

yzh119 approved these changes

View reviewed changes

Collaborator

yzh119 left a comment

Thanks for your contribution @LinHeLurking and thank @reyoung for the review!

include/flashinfer/attention/decode.cuh Show resolved Hide resolved

yzh119 merged commit 89f2c4a into flashinfer-ai:main

github-actions bot mentioned this pull request

chore(main): release 0.2.0 #476

Open

yzh119 mentioned this pull request

perf: remove unnecessary contiguous operation in block sparse attention #561

Merged

tsu-bin added a commit to tsu-bin/flashinfer_dev that referenced this pull request


          fix broken cpp integration caused by flashinfer-ai#553

e29b9ce

tsu-bin added a commit to tsu-bin/flashinfer_dev that referenced this pull request


          fix broken cpp integration caused by flashinfer-ai#553

tsu-bin mentioned this pull request

fix broken cpp integration caused by #553 #570

Merged

yzh119 pushed a commit that referenced this pull request


          bugfix: fix broken cpp integration caused by #553 (#570)

e46d9a7

Hi, when I try to rebase my current work, just found cpp integration
(benchmark and test) failed to build, this is introduced by the feature
#553.
Tests have been passed.

Co-authored-by: tsu-bin <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet