Move WGMMA documentation

triton-lang · Oct 29, 2024 · 3009866 · 3009866
1 parent b56997c
commit 3009866
Show file tree

Hide file tree

Showing 2 changed files with 17 additions and 21 deletions.
diff --git a/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td b/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
@@ -1313,9 +1313,26 @@ kWidth defines number of consecutive elements stored by one thread along k dimen
 Some layouts do not use this parameter, either because they have a fixed number of
 elements along the K dim, or they use all elements of the tensor along the K dim.
 
+# WGMMA Notes
 We require kWidth to be provided for Hopper because the dtype at loading might be
 different from the dtype at WGMMA, due to casting. The kWidth is determined by the
 dtype at WGMMA.
+
+The encoded tensor consists of operand A for possibly multiple wgmma instructions.
+For each wgmma, each warp in a warp group feeds a single "warp matrix"
+Each warp matrix consists of 2x2 "quads".
+Each thread holds several elements in each quad. Right before a wgmma,
+the sum of bitwidth of
+the elements in each quad should add up to 32.
+
+These values are stored unrolled in `elements`.
+The ordering of dimensions is as follows by convention:
+batch (only 1 batch for Hopper currently)
+matM (m-index of the "warp matrix")
+matK (k-index of the "warp matrix")
+quadK (k-index of the "quad" in the core matrix)
+quadM (m-index of the "quad" in the core matrix)
+vecIdx (index of the element in the quad; this is always along the k-dim)
   }];
 
   let parameters = (

diff --git a/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/DotOpToLLVM/WGMMA.cpp b/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/DotOpToLLVM/WGMMA.cpp
@@ -264,27 +264,6 @@ DotOpMmaV3SmemLoader loadB(const LLVMTypeConverter *typeConverter,
 
 // Return a vector of Value of the accumulator start at startIndex and pack the
 // values into 32bits in case the accumulator is fp16.
-//
-// `elements` contains all loaded register values for operand A.
-// This consists of operand A for possibly multiple wgmma instructions.
-// For each wgmma, each warp in a warp group feeds a single "warp matrix"
-// Each warp matrix consists of 2x2 "quads".
-// Each thread holds several elements in each quad. Right before a wgmma,
-// the sum of bitwidth of
-// the elements in each quad should add up to 32.
-//
-// These values are stored unrolled in `elements`.
-// The ordering of dimensions is as follows:
-// batch (only 1 batch for Hopper currently)
-// matM (m-index of the "warp matrix")
-// matK (k-index of the "warp matrix")
-// quadK (k-index of the "quad" in the core matrix)
-// quadM (m-index of the "quad" in the core matrix)
-// vecIdx (index of the element in the quad; this is always along the k-dim)
-//
-// This ordering is decided when a tensor in DotOpEnc is lowered into llvm.
-// For WGMMA this happens in both SharedToDotOperand and MMAToDotOperand.
-// Thus, both lowerings must obey this above ordering for the below code to be correct.
 llvm::SmallVector<Value> loadReg(ConversionPatternRewriter &rewriter,
                                  Location loc,
                                  const SmallVector<Value> &elements,