-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Question on customize epilogue reduction #1301
Comments
which hardware do you use? what is the data type? do you want to use tensor cores? |
Thanks! I am using A100, both cutlass 2.x and 3.x is suitable for me. The data type of A and B are My custom kernel basically similar to pooling, which takes Specifically, if there is any interface, I could easily implement it:
I do concern about memory consumption, so I don't want to store the 2m-by-2n matrix in global memory and launch another kernel to perform this operation. |
in 2.x, you can get row coordinate from every threads own several fragments. every fragment owns you can first dump row coordinates and check the mapping between thread id and row coordinate. all the mapping information you need is actually in don't forget to change the memory pointer at last (
|
Thank you for the detailed reply. I'll try it later. |
This issue has been labeled |
@zejia-lin have you resolved your issue? |
I am sorry for the late response. I found I was not able to resolve it under reasonable efforts. I am closing this issue. |
What is your question?
Hello, I found that many epilogues are element-wise. I wondered if it could be customized to sum up a
2*2
tile instead of an element-wise operation. That is, forD = AB + C
, where A is a(m*2, k)
matrix, B is a(k, n*2)
matrix, and C, D is(m, n)
matrix . While AB produces a(m*2, n*2)
matrix, is it possible to sum up every2*2
tile of the output matrix and produce a(m, n)
matrix?Many thanks for any advice.
The text was updated successfully, but these errors were encountered: