Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences between Triton and Cuda implementations #4

Open
gabeweisz opened this issue Aug 6, 2024 · 3 comments
Open

Differences between Triton and Cuda implementations #4

gabeweisz opened this issue Aug 6, 2024 · 3 comments

Comments

@gabeweisz
Copy link

The Triton version of the code uses the block maximum when renormalizing after all blocks have been processed, while the Cuda version does not. My guess is that because the Cuda flash attention library doesn't return the block maximum and it would be complicated to update the code to do so.

Can you comment on the numerical effects of this?

@MayDomine
Copy link
Owner

Yes,you are right.
We believe that if you use fp32 to store all these scale variable like m_i or lse,it works well as perfect.

@gabeweisz
Copy link
Author

So you are saying that with the CUDA version you don't need m_i if you use fp32?

@MayDomine
Copy link
Owner

MayDomine commented Aug 6, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants