Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start work on wave 64 optimizeation #11495

Merged
merged 3 commits into from
Jan 30, 2025
Merged

Conversation

IMbackK
Copy link
Collaborator

@IMbackK IMbackK commented Jan 29, 2025

ggml assumes that wave size is 32 all over the place.

This adds some additional code to begin to support optimizing kernels for wave 64 this includes:

  1. Currently WARP_SIZE simply defines the assumed warp size to be 32 at compile time, even in host code.
    • This pr adds warp_size to the cuda_device_info struct as we can not know the warp size of the device in host code at compile time
    • for now this is not used anywhere but slowly migrating the code to this on the host side is necessary.
  2. Adds the width to the reduction operators as a template parameter
    • for now this also affects nothing as the template parameter is always 32
    • privately i have converted gemv and softmax to selectable wave size but as WAVE_SIZE is set globally and other kernels assume this to be 32 this is too hacky to pr in at this time.
  3. Updates the builtin usage in the reduction operators on hip and adds support for warp_reduce_max and ggml_cuda_hmax2
    • This makes the minimum supported version of rocm 5.5 at this time, i however dont believe this is an issue as i am unaware of any distro that packages <5.7 an no gpu that worked in rocm 5.5 dose not also work in 6.0+ (this includes gfx900 and 803)

@IMbackK
Copy link
Collaborator Author

IMbackK commented Jan 29, 2025

As a reminder amd GCN and CDNA gpus are wave/warp 64 and RDNA gpus are wave 32 or 64 selectable at runtime.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 29, 2025
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking I welcome support for warp sizes != 32. I will say though that I would not be willing to maintain variable warp sizes for MMQ and FlashAttention which are already very time-consuming as-is.

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved
ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved
Comment on lines 243 to 245
info.devices[id].nsm = prop.multiProcessorCount;
info.devices[id].smpb = prop.sharedMemPerBlock;
info.devices[id].warp_size = prop.warpSize;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info.devices[id].nsm = prop.multiProcessorCount;
info.devices[id].smpb = prop.sharedMemPerBlock;
info.devices[id].warp_size = prop.warpSize;
info.devices[id].nsm = prop.multiProcessorCount;
info.devices[id].smpb = prop.sharedMemPerBlock;
info.devices[id].warp_size = prop.warpSize;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but i hate this, this kind of alignment makes merges harder, blame less useful and diffs harder to read.

Comment on lines 227 to 230
for (int offset = 16; offset > 0; offset >>= 1) {
const half2 a_other = __shfl_xor_sync(0xffffffff, a, offset, 32);
reinterpret_cast<half&>(a.x) += __low2half(a_other);
reinterpret_cast<half&>(a.y) += __high2half(a_other);
for (int offset = width/2; offset > 0; offset >>= 1) {
a = __hadd2(a, __shfl_xor_sync(0xffffffff, a, offset, width));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why there was a float conversion here. I don't think it was for performance reasons though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant think of a reason and the fattn operators still check out on gfx908 and gfx1030.

@IMbackK
Copy link
Collaborator Author

IMbackK commented Jan 29, 2025

No idea why the macOS-latest-cmake-arm64 ci is falling its operator test, that dosen seam like it could be the fault of this pr

@JohannesGaessler
Copy link
Collaborator

test-backend-ops uses a random seed for generating the tensors that are used for checking the operators against each other. As a consequence there is a small chance for the tests to fail due to unlucky RNG causing numerical issues. Just re-run the tests.

@IMbackK IMbackK merged commit 27d135c into ggerganov:master Jan 30, 2025
45 checks passed
@ggerganov
Copy link
Owner

@IMbackK Note to squash-merge PRs in the future.

@IMbackK
Copy link
Collaborator Author

IMbackK commented Feb 3, 2025

@ggerganov understood, will do.

i merged these as is as af71052 and a151674 don't directly depend on each other and can be independently reverted if issues arise.

@ggerganov
Copy link
Owner

No problem. This is generally fine to do. The important thing is for commit titles to have the (#number) at the end because this is currently used by the sync scripts and without the PR number, they get confused. When we squash-merge, this number gets automatically appended by the Github UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants