[Kernel] Add flash-attn back #4907

WoosukKwon · 2024-05-19T10:03:36Z

This PR reverts #4820 by adding back flash-attn. Previously, using flash-attn for decoding caused errors when using small models (like the Llama 68M model in lora/test_layer_variation.py). This was because the index calculation for paged KV cache was done in int instead of int64_t, leading to integer overflow when num_blocks is large. In vllm-flash-attn==2.5.8.post2, the overflow bug was fixed.

WoosukKwon added 6 commits May 17, 2024 02:17

Add flash-attn back

db946d5

Merge branch 'main' into add-flash-attn

831a06b

Support prefix caching

07ffd1b

Upgrade fa version

615529e

mypy

555a499

Test with large num_blocks

bb2624f

WoosukKwon requested a review from rkooo567 May 19, 2024 17:25

WoosukKwon marked this pull request as ready for review May 19, 2024 17:26

WoosukKwon requested a review from LiuXiaoxuanPKU May 19, 2024 17:37

Yard1 approved these changes May 19, 2024

View reviewed changes

WoosukKwon merged commit b57e6c5 into main May 20, 2024
61 checks passed

WoosukKwon deleted the add-flash-attn branch May 20, 2024 01:11

comaniac mentioned this pull request May 20, 2024

[Misc] Load FP8 kv-cache scaling factors from checkpoints #4893

Merged

1 task

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

[Kernel] Add flash-attn back (vllm-project#4907)

fd1308b

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 8, 2024

[Kernel] Add flash-attn back (vllm-project#4907)

81ec16b

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 14, 2024

[Kernel] Add flash-attn back (vllm-project#4907)

b25e7e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Add flash-attn back #4907

[Kernel] Add flash-attn back #4907

WoosukKwon commented May 19, 2024 •

edited

Loading

[Kernel] Add flash-attn back #4907

[Kernel] Add flash-attn back #4907

Conversation

WoosukKwon commented May 19, 2024 • edited Loading

WoosukKwon commented May 19, 2024 •

edited

Loading