You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Triton version of the code uses the block maximum when renormalizing after all blocks have been processed, while the Cuda version does not. My guess is that because the Cuda flash attention library doesn't return the block maximum and it would be complicated to update the code to do so.
Can you comment on the numerical effects of this?
The text was updated successfully, but these errors were encountered:
---Original---
From: ***@***.***>
Date: Tue, Aug 6, 2024 22:12 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [MayDomine/Burst-Attention] Differences between Triton and Cudaimplementations (Issue #4)
So you are saying that with the CUDA version you don't need m_i if you use fp32?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
The Triton version of the code uses the block maximum when renormalizing after all blocks have been processed, while the Cuda version does not. My guess is that because the Cuda flash attention library doesn't return the block maximum and it would be complicated to update the code to do so.
Can you comment on the numerical effects of this?
The text was updated successfully, but these errors were encountered: