-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start work on wave 64 optimizeation #11495
Conversation
As a reminder amd GCN and CDNA gpus are wave/warp 64 and RDNA gpus are wave 32 or 64 selectable at runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking I welcome support for warp sizes != 32. I will say though that I would not be willing to maintain variable warp sizes for MMQ and FlashAttention which are already very time-consuming as-is.
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
info.devices[id].nsm = prop.multiProcessorCount; | ||
info.devices[id].smpb = prop.sharedMemPerBlock; | ||
info.devices[id].warp_size = prop.warpSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
info.devices[id].nsm = prop.multiProcessorCount; | |
info.devices[id].smpb = prop.sharedMemPerBlock; | |
info.devices[id].warp_size = prop.warpSize; | |
info.devices[id].nsm = prop.multiProcessorCount; | |
info.devices[id].smpb = prop.sharedMemPerBlock; | |
info.devices[id].warp_size = prop.warpSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but i hate this, this kind of alignment makes merges harder, blame less useful and diffs harder to read.
ggml/src/ggml-cuda/common.cuh
Outdated
for (int offset = 16; offset > 0; offset >>= 1) { | ||
const half2 a_other = __shfl_xor_sync(0xffffffff, a, offset, 32); | ||
reinterpret_cast<half&>(a.x) += __low2half(a_other); | ||
reinterpret_cast<half&>(a.y) += __high2half(a_other); | ||
for (int offset = width/2; offset > 0; offset >>= 1) { | ||
a = __hadd2(a, __shfl_xor_sync(0xffffffff, a, offset, width)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember why there was a float conversion here. I don't think it was for performance reasons though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant think of a reason and the fattn operators still check out on gfx908 and gfx1030.
No idea why the macOS-latest-cmake-arm64 ci is falling its operator test, that dosen seam like it could be the fault of this pr |
|
@IMbackK Note to squash-merge PRs in the future. |
@ggerganov understood, will do. i merged these as is as af71052 and a151674 don't directly depend on each other and can be independently reverted if issues arise. |
No problem. This is generally fine to do. The important thing is for commit titles to have the |
ggml assumes that wave size is 32 all over the place.
This adds some additional code to begin to support optimizing kernels for wave 64 this includes: