Fix flash attention GQA bug to use the dynamic size of the key/value tensors - used for eval/inference #756

sashaDoubov · 2023-11-21T23:12:23Z

This issue shows up w/ flash attention during inference or if running icl tasks, the .view() causes a runtime error due to the wrong size of key_unpad and value_unpad. Ty to @ShashankMosaicML for flagging!
Example:
RuntimeError: shape '[2, 1024, 5, -1]' is invalid for input of size 192000

Tested with 3b models to have same loss with the fix and without:

Eval only works for the fix:

…-flash-attn-seq-len

sashaDoubov added 7 commits November 14, 2023 22:15

fix attn impl not being set

60cdf36

change d_model and increase tolerance

f603b46

add special case

51a43b1

undo typo

d915934

fix unpacking

045b473

Merge branch 'main' of github.com:mosaicml/llm-foundry into sasha/fix…

bc7500d

…-flash-attn-seq-len

Merge branch 'main' of github.com:mosaicml/llm-foundry into sasha/fix…

05e0062

…-flash-attn-seq-len

sashaDoubov requested review from dakinggg and ShashankMosaicML November 21, 2023 23:12

add padding param to attention test

129ab2a

ShashankMosaicML approved these changes Nov 21, 2023

View reviewed changes

sashaDoubov merged commit 1793c36 into mosaicml:main Nov 21, 2023
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flash attention GQA bug to use the dynamic size of the key/value tensors - used for eval/inference #756

Fix flash attention GQA bug to use the dynamic size of the key/value tensors - used for eval/inference #756

sashaDoubov commented Nov 21, 2023

Fix flash attention GQA bug to use the dynamic size of the key/value tensors - used for eval/inference #756

Fix flash attention GQA bug to use the dynamic size of the key/value tensors - used for eval/inference #756

Conversation

sashaDoubov commented Nov 21, 2023