Differences between Triton and Cuda implementations #4

gabeweisz · 2024-08-06T13:39:32Z

The Triton version of the code uses the block maximum when renormalizing after all blocks have been processed, while the Cuda version does not. My guess is that because the Cuda flash attention library doesn't return the block maximum and it would be complicated to update the code to do so.

Can you comment on the numerical effects of this?

MayDomine · 2024-08-06T14:07:23Z

Yes，you are right.
We believe that if you use fp32 to store all these scale variable like m_i or lse，it works well as perfect.

gabeweisz · 2024-08-06T14:12:29Z

So you are saying that with the CUDA version you don't need m_i if you use fp32?

MayDomine · 2024-08-06T15:52:41Z

please read the appendix of flash-attention. It only needs lse to scale the softmax in flash attention's way.

…

---Original--- From: ***@***.***> Date: Tue, Aug 6, 2024 22:12 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [MayDomine/Burst-Attention] Differences between Triton and Cudaimplementations (Issue #4) So you are saying that with the CUDA version you don't need m_i if you use fp32? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences between Triton and Cuda implementations #4

Differences between Triton and Cuda implementations #4

gabeweisz commented Aug 6, 2024

MayDomine commented Aug 6, 2024

gabeweisz commented Aug 6, 2024

MayDomine commented Aug 6, 2024 via email •

edited

Loading

Differences between Triton and Cuda implementations #4

Differences between Triton and Cuda implementations #4

Comments

gabeweisz commented Aug 6, 2024

MayDomine commented Aug 6, 2024

gabeweisz commented Aug 6, 2024

MayDomine commented Aug 6, 2024 via email • edited Loading

MayDomine commented Aug 6, 2024 via email •

edited

Loading