Start work on wave 64 optimizeation #11495

IMbackK · 2025-01-29T18:49:40Z

ggml assumes that wave size is 32 all over the place.

This adds some additional code to begin to support optimizing kernels for wave 64 this includes:

Currently WARP_SIZE simply defines the assumed warp size to be 32 at compile time, even in host code.
- This pr adds warp_size to the cuda_device_info struct as we can not know the warp size of the device in host code at compile time
- for now this is not used anywhere but slowly migrating the code to this on the host side is necessary.
Adds the width to the reduction operators as a template parameter
- for now this also affects nothing as the template parameter is always 32
- privately i have converted gemv and softmax to selectable wave size but as WAVE_SIZE is set globally and other kernels assume this to be 32 this is too hacky to pr in at this time.
Updates the builtin usage in the reduction operators on hip and adds support for warp_reduce_max and ggml_cuda_hmax2
- This makes the minimum supported version of rocm 5.5 at this time, i however dont believe this is an issue as i am unaware of any distro that packages <5.7 an no gpu that worked in rocm 5.5 dose not also work in 6.0+ (this includes gfx900 and 803)

IMbackK · 2025-01-29T18:51:18Z

As a reminder amd GCN and CDNA gpus are wave/warp 64 and RDNA gpus are wave 32 or 64 selectable at runtime.

JohannesGaessler

Generally speaking I welcome support for warp sizes != 32. I will say though that I would not be willing to maintain variable warp sizes for MMQ and FlashAttention which are already very time-consuming as-is.

ggml/src/ggml-cuda/common.cuh

JohannesGaessler · 2025-01-29T19:07:18Z

ggml/src/ggml-cuda/ggml-cuda.cu

        info.devices[id].nsm   = prop.multiProcessorCount;
        info.devices[id].smpb  = prop.sharedMemPerBlock;
+        info.devices[id].warp_size = prop.warpSize;


Suggested change

info.devices[id].nsm = prop.multiProcessorCount;

info.devices[id].smpb = prop.sharedMemPerBlock;

info.devices[id].warp_size = prop.warpSize;

info.devices[id].nsm = prop.multiProcessorCount;

info.devices[id].smpb = prop.sharedMemPerBlock;

info.devices[id].warp_size = prop.warpSize;

Done, but i hate this, this kind of alignment makes merges harder, blame less useful and diffs harder to read.

JohannesGaessler · 2025-01-29T19:11:27Z

ggml/src/ggml-cuda/common.cuh

-    for (int offset = 16; offset > 0; offset >>= 1) {
-        const half2 a_other = __shfl_xor_sync(0xffffffff, a, offset, 32);
-        reinterpret_cast<half&>(a.x) +=  __low2half(a_other);
-        reinterpret_cast<half&>(a.y) += __high2half(a_other);
+    for (int offset = width/2; offset > 0; offset >>= 1) {
+        a = __hadd2(a, __shfl_xor_sync(0xffffffff, a, offset, width));


I don't remember why there was a float conversion here. I don't think it was for performance reasons though.

I cant think of a reason and the fattn operators still check out on gfx908 and gfx1030.

IMbackK · 2025-01-29T20:48:57Z

No idea why the macOS-latest-cmake-arm64 ci is falling its operator test, that dosen seam like it could be the fault of this pr

JohannesGaessler · 2025-01-29T21:15:38Z

test-backend-ops uses a random seed for generating the tensors that are used for checking the operators against each other. As a consequence there is a small chance for the tests to fail due to unlucky RNG causing numerical issues. Just re-run the tests.

ggerganov · 2025-02-03T12:51:15Z

@IMbackK Note to squash-merge PRs in the future.

IMbackK · 2025-02-03T13:10:55Z

@ggerganov understood, will do.

i merged these as is as af71052 and a151674 don't directly depend on each other and can be independently reverted if issues arise.

ggerganov · 2025-02-03T13:17:40Z

No problem. This is generally fine to do. The important thing is for commit titles to have the (#number) at the end because this is currently used by the sync scripts and without the PR number, they get confused. When we squash-merge, this number gets automatically appended by the Github UI.

CUDA/HIP: add warp_size to cuda_device_info

a151674

IMbackK requested a review from JohannesGaessler January 29, 2025 18:49

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 29, 2025

JohannesGaessler reviewed Jan 29, 2025

View reviewed changes

IMbackK added 2 commits January 29, 2025 21:45

HIP: Prepare reduction operators for wave 64

af71052

HIP: require at least HIP 5.5

7e1c85c

IMbackK force-pushed the addWarpSize branch from 294b7e6 to 7e1c85c Compare January 29, 2025 20:46

JohannesGaessler approved these changes Jan 29, 2025

View reviewed changes

IMbackK merged commit 27d135c into ggerganov:master Jan 30, 2025
45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start work on wave 64 optimizeation #11495

Start work on wave 64 optimizeation #11495

IMbackK commented Jan 29, 2025 •

edited

Loading

IMbackK commented Jan 29, 2025

JohannesGaessler left a comment

JohannesGaessler Jan 29, 2025

IMbackK Jan 29, 2025

JohannesGaessler Jan 29, 2025

IMbackK Jan 29, 2025

IMbackK commented Jan 29, 2025

JohannesGaessler commented Jan 29, 2025

ggerganov commented Feb 3, 2025

IMbackK commented Feb 3, 2025

ggerganov commented Feb 3, 2025

Start work on wave 64 optimizeation #11495

Start work on wave 64 optimizeation #11495

Conversation

IMbackK commented Jan 29, 2025 • edited Loading

IMbackK commented Jan 29, 2025

JohannesGaessler left a comment

Choose a reason for hiding this comment

JohannesGaessler Jan 29, 2025

Choose a reason for hiding this comment

IMbackK Jan 29, 2025

Choose a reason for hiding this comment

JohannesGaessler Jan 29, 2025

Choose a reason for hiding this comment

IMbackK Jan 29, 2025

Choose a reason for hiding this comment

IMbackK commented Jan 29, 2025

JohannesGaessler commented Jan 29, 2025

ggerganov commented Feb 3, 2025

IMbackK commented Feb 3, 2025

ggerganov commented Feb 3, 2025

IMbackK commented Jan 29, 2025 •

edited

Loading