-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLIR][NVGPU] Introduce nvgpu.mbarrier.group
for multiple mbarrier use
#65951
Conversation
@llvm/pr-subscribers-mlir ChangesA common practice involves the creation of multiple PR improves Having
We will have mbarrier usages like below:
-- Patch is 43.07 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/65951.diff 5 Files Affected:
diff --git a/mlir/include/mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h b/mlir/include/mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h index 8c5667cd417f0d4..4b8d5c5fe2a893d 100644 --- a/mlir/include/mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h +++ b/mlir/include/mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h @@ -23,15 +23,15 @@ class Pass; #include "mlir/Conversion/Passes.h.inc" namespace nvgpu { -class MBarrierType; +class MBarrierGroupType; /// Returns the memory space attribute of the mbarrier object. Attribute getMbarrierMemorySpace(MLIRContext *context, - MBarrierType barrierType); + MBarrierGroupType barrierType); /// Return the memref type that can be used to represent an mbarrier object. MemRefType getMBarrierMemrefType(MLIRContext *context, - MBarrierType barrierType); + MBarrierGroupType barrierType); } // namespace nvgpu void populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter, diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td index a3245bf9196eed1..cc09945b477d8fa 100644 --- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td +++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td @@ -135,20 +135,26 @@ def NVGPU_DeviceAsyncToken : NVGPU_Type<"DeviceAsyncToken", }]; } -def NVGPU_MBarrier : NVGPU_Type<"MBarrier", "mbarrier.barrier", []> { +def NVGPU_MBarrierGroup : NVGPU_Type<"MBarrierGroup", "mbarrier.group", []> { let summary = "mbarrier barrier type"; let description = [{ - This is the type for a mbarrier object in shared memory that is used - to synchronize a variable number of threads. + This is the type for one or more mbarrier object in shared memory that is + used to synchronize a variable number of threads. - The mbarrier object is 64 bit with 8 byte alignment. The mbarrier object - can be initiated and invalidated. + If `num_barriers` is not set, the number of mbarrier objects is 1. - See for more details: - https://docs.nvidia.com/cuda/parallel-thread-execution/#size-and-alignment-of-mbarrier-object + A mbarrier object is 64 bit with 8 byte alignment. The mbarrier object + can be initiated and invalidated. + + [See for more details in PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#size-and-alignment-of-mbarrier-object) }]; - let parameters = (ins "Attribute":$memorySpace); + let parameters = (ins "Attribute":$memorySpace, DefaultValuedParameter<"unsigned", "1">:$num_barriers); let assemblyFormat = "`<` struct(params) `>`"; + let builders = [ + TypeBuilder<(ins "Attribute":$memorySpace), [{ + return $_get($_ctxt, memorySpace, 1); + }]> + ]; } def NVGPU_MBarrierToken : NVGPU_Type<"MBarrierToken", "mbarrier.token", []> { } @@ -473,7 +479,7 @@ def NVGPU_DeviceAsyncWaitOp : NVGPU_Op<"device_async_wait", []> { def NVGPU_MBarrierCreateOp : NVGPU_Op<"mbarrier.create", []> { let summary = "Creates a `nvgpu.mbarrier` object."; let description = [{ - The Op generates an `mbarrier` object, which is a barrier created in + The Op generates one or more `mbarrier` object, which is a barrier created in shared memory and supports various synchronization behaviors for threads. The `mbarrier` object has the following type and alignment requirements: @@ -485,9 +491,9 @@ def NVGPU_MBarrierCreateOp : NVGPU_Op<"mbarrier.create", []> { ``` }]; let arguments = (ins); - let results = (outs NVGPU_MBarrier:$barrier); + let results = (outs NVGPU_MBarrierGroup:$barriers); let assemblyFormat = [{ - attr-dict `->` type($barrier) + attr-dict `->` type($barriers) }]; } @@ -503,8 +509,8 @@ def NVGPU_MBarrierInitOp : NVGPU_Op<"mbarrier.init", []> { nvgpu.mbarrier.init %barrier, %num_threads : !nvgpu.mbarrier.barrier> ``` }]; - let arguments = (ins NVGPU_MBarrier:$barrier, Index:$count); - let assemblyFormat = "$barrier `,` $count attr-dict `:` type($barrier)"; + let arguments = (ins NVGPU_MBarrierGroup:$barriers, Index:$count, Index:$mbarId); + let assemblyFormat = "$barriers `[` $mbarId `]` `,` $count attr-dict `:` type($barriers)"; } def NVGPU_MBarrierTestWaitOp : NVGPU_Op<"mbarrier.test.wait", []> { @@ -518,9 +524,9 @@ def NVGPU_MBarrierTestWaitOp : NVGPU_Op<"mbarrier.test.wait", []> { %isComplete = nvgpu.mbarrier.test.wait %barrier, %token : !nvgpu.mbarrier.barrier>, !nvgpu.mbarrier.token ``` }]; - let arguments = (ins NVGPU_MBarrier:$barrier, NVGPU_MBarrierToken:$token); + let arguments = (ins NVGPU_MBarrierGroup:$barriers, NVGPU_MBarrierToken:$token, Index:$mbarId); let results = (outs I1:$waitComplete); - let assemblyFormat = "$barrier `,` $token attr-dict `:` type($barrier) `,` type($token)"; + let assemblyFormat = "$barriers `[` $mbarId `]` `,` $token attr-dict `:` type($barriers) `,` type($token)"; } def NVGPU_MBarrierArriveOp : NVGPU_Op<"mbarrier.arrive", []> { @@ -537,9 +543,9 @@ def NVGPU_MBarrierArriveOp : NVGPU_Op<"mbarrier.arrive", []> { %token = nvgpu.mbarrier.arrive %barrier : !nvgpu.mbarrier.barrier> -> !nvgpu.mbarrier.token ``` }]; - let arguments = (ins NVGPU_MBarrier:$barrier); + let arguments = (ins NVGPU_MBarrierGroup:$barriers, Index:$mbarId); let results = (outs NVGPU_MBarrierToken:$token); -let assemblyFormat = "$barrier attr-dict `:` type($barrier) `->` type($token)"; +let assemblyFormat = "$barriers `[` $mbarId `]` attr-dict `:` type($barriers) `->` type($token)"; } def NVGPU_MBarrierArriveNoCompleteOp : NVGPU_Op<"mbarrier.arrive.nocomplete", []> { @@ -555,10 +561,10 @@ def NVGPU_MBarrierArriveNoCompleteOp : NVGPU_Op<"mbarrier.arrive.nocomplete", [] %token = nvgpu.mbarrier.arrive.noComplete %barrier, %count : !nvgpu.mbarrier.barrier> -> !nvgpu.mbarrier.token ``` }]; - let arguments = (ins NVGPU_MBarrier:$barrier, + let arguments = (ins NVGPU_MBarrierGroup:$barriers, Index:$mbarId, Index:$count); let results = (outs NVGPU_MBarrierToken:$token); - let assemblyFormat = "$barrier `,` $count attr-dict `:` type($barrier) `->` type($token)"; + let assemblyFormat = "$barriers `[` $mbarId `]` `,` $count attr-dict `:` type($barriers) `->` type($token)"; } def NVGPU_MBarrierArriveExpectTxOp : NVGPU_Op<"mbarrier.arrive.expect_tx", []> { @@ -578,9 +584,8 @@ def NVGPU_MBarrierArriveExpectTxOp : NVGPU_Op<"mbarrier.arrive.expect_tx", []> { nvgpu.mbarrier.arrive.expect_tx %barrier, %ic0 : !nvgpu.mbarrier.barrier> ``` }]; - let arguments = (ins NVGPU_MBarrier:$barrier, - Index:$txcount); - let assemblyFormat = "$barrier `,` $txcount attr-dict `:` type($barrier)"; + let arguments = (ins NVGPU_MBarrierGroup:$barriers, Index:$txcount, Index:$mbarId); + let assemblyFormat = "$barriers `[` $mbarId `]` `,` $txcount attr-dict `:` type($barriers)"; } def NVGPU_MBarrierTryWaitParityOp : NVGPU_Op<"mbarrier.try_wait.parity", []> { @@ -597,8 +602,8 @@ def NVGPU_MBarrierTryWaitParityOp : NVGPU_Op<"mbarrier.try_wait.parity", []> { ``` }]; - let arguments = (ins NVGPU_MBarrier:$barrier, Index:$phase, Index:$ticks); - let assemblyFormat = "$barrier `,` $phase `,` $ticks attr-dict `:` type($barrier)"; + let arguments = (ins NVGPU_MBarrierGroup:$barriers, Index:$phase, Index:$ticks, Index:$mbarId); + let assemblyFormat = "$barriers `[` $mbarId `]` `,` $phase `,` $ticks attr-dict `:` type($barriers)"; } def NVGPU_TmaAsyncLoadOp : NVGPU_Op<"tma.async.load", []> { @@ -613,12 +618,13 @@ def NVGPU_TmaAsyncLoadOp : NVGPU_Op<"tma.async.load", []> { The Op uses `$barrier` mbarrier based completion mechanism. }]; let arguments = (ins Arg:$dst, - NVGPU_MBarrier:$barrier, + NVGPU_MBarrierGroup:$barriers, NVGPU_TensorMapDescriptor:$tensorMapDescriptor, - Variadic:$coordinates); + Variadic:$coordinates, + Index:$mbarId); let assemblyFormat = [{ - $tensorMapDescriptor `[` $coordinates `]` `,` $barrier `to` $dst - attr-dict `:` type($tensorMapDescriptor) `,` type($barrier) `->` type($dst) + $tensorMapDescriptor `[` $coordinates `]` `,` $barriers `[` $mbarId `]` `to` $dst + attr-dict `:` type($tensorMapDescriptor) `,` type($barriers) `->` type($dst) }]; let hasVerifier = 1; diff --git a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp index b045089244ff1a7..b008572eb443b18 100644 --- a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp +++ b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp @@ -17,8 +17,10 @@ #include "mlir/Dialect/LLVMIR/NVVMDialect.h" #include "mlir/Dialect/MemRef/IR/MemRef.h" #include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h" +#include "mlir/IR/BuiltinTypes.h" #include "mlir/IR/PatternMatch.h" #include "mlir/IR/TypeUtilities.h" +#include "mlir/IR/Value.h" #include "mlir/Pass/Pass.h" #include "llvm/Support/Debug.h" #include "llvm/Support/raw_ostream.h" @@ -212,14 +214,14 @@ static SmallVector unpackOperandVector(RewriterBase &rewriter, } /// Returns whether mbarrier object has shared memory address space. -static bool isMbarrierShared(nvgpu::MBarrierType barrierType) { +static bool isMbarrierShared(nvgpu::MBarrierGroupType barrierType) { return (mlir::nvgpu::NVGPUDialect::isSharedMemoryAddressSpace( barrierType.getMemorySpace())); } /// Returns the memory space attribute of the mbarrier object. Attribute nvgpu::getMbarrierMemorySpace(MLIRContext *context, - nvgpu::MBarrierType barrierType) { + nvgpu::MBarrierGroupType barrierType) { Attribute memorySpace = {}; if (isMbarrierShared(barrierType)) { memorySpace = @@ -230,25 +232,13 @@ Attribute nvgpu::getMbarrierMemorySpace(MLIRContext *context, } /// Returns memref type of the mbarrier object. The type is defined in the -/// MBarrierType. +/// MBarrierGroupType. MemRefType nvgpu::getMBarrierMemrefType(MLIRContext *context, - nvgpu::MBarrierType barrierType) { + nvgpu::MBarrierGroupType barrierType) { Attribute memorySpace = nvgpu::getMbarrierMemorySpace(context, barrierType); MemRefLayoutAttrInterface layout; - return MemRefType::get({1}, IntegerType::get(context, 64), layout, - memorySpace); -} - -/// Returns the base pointer of the mbarrier object. -static Value getMbarrierPtr(ConversionPatternRewriter &rewriter, - const LLVMTypeConverter &typeConverter, - TypedValue barrier, - Value barrierMemref) { - MemRefType memrefType = - nvgpu::getMBarrierMemrefType(rewriter.getContext(), barrier.getType()); - MemRefDescriptor memRefDescriptor(barrierMemref); - return memRefDescriptor.bufferPtr(rewriter, barrier.getLoc(), typeConverter, - memrefType); + return MemRefType::get({barrierType.getNumBarriers()}, + IntegerType::get(context, 64), layout, memorySpace); } namespace { @@ -426,7 +416,7 @@ struct ConvertNVGPUToNVVMPass [&](nvgpu::WarpgroupMatrixDescriptorType type) -> Type { return converter.convertType(IntegerType::get(type.getContext(), 64)); }); - converter.addConversion([&](nvgpu::MBarrierType type) -> Type { + converter.addConversion([&](nvgpu::MBarrierGroupType type) -> Type { return converter.convertType( nvgpu::getMBarrierMemrefType(rewriter.getContext(), type)); }); @@ -762,7 +752,7 @@ struct NVGPUMBarrierCreateLowering ConversionPatternRewriter &rewriter) const override { Operation *funcOp = op->getParentOp(); MemRefType barrierType = nvgpu::getMBarrierMemrefType( - rewriter.getContext(), op.getBarrier().getType()); + rewriter.getContext(), op.getBarriers().getType()); memref::GlobalOp global; if (auto moduleOp = funcOp->getParentOfType()) @@ -777,21 +767,37 @@ struct NVGPUMBarrierCreateLowering } }; +/// Base class for lowering mbarrier operations to nvvm intrinsics. +template +struct MBarrierBasePattern : public ConvertOpToLLVMPattern { +public: + using ConvertOpToLLVMPattern::ConvertOpToLLVMPattern; + /// Returns the base pointer of the mbarrier object. + Value getMbarrierPtr(Operation *op, nvgpu::MBarrierGroupType mbarType, + Value memrefDesc, Value mbarId, + ConversionPatternRewriter &rewriter) const { + MemRefType mbarrierMemrefType = + nvgpu::getMBarrierMemrefType(rewriter.getContext(), mbarType); + return ConvertToLLVMPattern::getStridedElementPtr( + op->getLoc(), mbarrierMemrefType, memrefDesc, {mbarId}, rewriter); + return memrefDesc; + } +}; + /// Lowers `nvgpu.mbarrier.init` to `nvvm.mbarrier.init` struct NVGPUMBarrierInitLowering - : public ConvertOpToLLVMPattern { - using ConvertOpToLLVMPattern::ConvertOpToLLVMPattern; + : public MBarrierBasePattern { + using MBarrierBasePattern::MBarrierBasePattern; LogicalResult matchAndRewrite(nvgpu::MBarrierInitOp op, OpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { + nvgpu::MBarrierGroupType mbarrierType = op.getBarriers().getType(); rewriter.setInsertionPoint(op); - Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(), - op.getBarrier(), adaptor.getBarrier()); - + Value barrier = getMbarrierPtr(op, mbarrierType, adaptor.getBarriers(), + adaptor.getMbarId(), rewriter); Value count = truncToI32(rewriter, op->getLoc(), adaptor.getCount()); - - if (isMbarrierShared(op.getBarrier().getType())) { + if (isMbarrierShared(mbarrierType)) { rewriter.replaceOpWithNewOp(op, barrier, count); } else { @@ -803,16 +809,17 @@ struct NVGPUMBarrierInitLowering /// Lowers `nvgpu.mbarrier.arrive` to `nvvm.mbarrier.arrive` struct NVGPUMBarrierArriveLowering - : public ConvertOpToLLVMPattern { - using ConvertOpToLLVMPattern::ConvertOpToLLVMPattern; + : public MBarrierBasePattern { + using MBarrierBasePattern::MBarrierBasePattern; LogicalResult matchAndRewrite(nvgpu::MBarrierArriveOp op, OpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { - Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(), - op.getBarrier(), adaptor.getBarrier()); + Value barrier = + getMbarrierPtr(op, op.getBarriers().getType(), adaptor.getBarriers(), + adaptor.getMbarId(), rewriter); Type tokenType = getTypeConverter()->convertType( nvgpu::MBarrierTokenType::get(op->getContext())); - if (isMbarrierShared(op.getBarrier().getType())) { + if (isMbarrierShared(op.getBarriers().getType())) { rewriter.replaceOpWithNewOp(op, tokenType, barrier); } else { @@ -826,19 +833,19 @@ struct NVGPUMBarrierArriveLowering /// Lowers `nvgpu.mbarrier.arrive.nocomplete` to /// `nvvm.mbarrier.arrive.nocomplete` struct NVGPUMBarrierArriveNoCompleteLowering - : public ConvertOpToLLVMPattern { - using ConvertOpToLLVMPattern< - nvgpu::MBarrierArriveNoCompleteOp>::ConvertOpToLLVMPattern; - + : public MBarrierBasePattern { + using MBarrierBasePattern< + nvgpu::MBarrierArriveNoCompleteOp>::MBarrierBasePattern; LogicalResult matchAndRewrite(nvgpu::MBarrierArriveNoCompleteOp op, OpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { - Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(), - op.getBarrier(), adaptor.getBarrier()); + Value barrier = + getMbarrierPtr(op, op.getBarriers().getType(), adaptor.getBarriers(), + adaptor.getMbarId(), rewriter); Type tokenType = getTypeConverter()->convertType( nvgpu::MBarrierTokenType::get(op->getContext())); Value count = truncToI32(rewriter, op->getLoc(), adaptor.getCount()); - if (isMbarrierShared(op.getBarrier().getType())) { + if (isMbarrierShared(op.getBarriers().getType())) { rewriter.replaceOpWithNewOp( op, tokenType, barrier, count); } else { @@ -851,17 +858,16 @@ struct NVGPUMBarrierArriveNoCompleteLowering /// Lowers `nvgpu.mbarrier.test.wait` to `nvvm.mbarrier.test.wait` struct NVGPUMBarrierTestWaitLowering - : public ConvertOpToLLVMPattern { - using ConvertOpToLLVMPattern< - nvgpu::MBarrierTestWaitOp>::ConvertOpToLLVMPattern; - + : public MBarrierBasePattern { + using MBarrierBasePattern::MBarrierBasePattern; LogicalResult matchAndRewrite(nvgpu::MBarrierTestWaitOp op, OpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { - Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(), - op.getBarrier(), adaptor.getBarrier()); + Value barrier = + getMbarrierPtr(op, op.getBarriers().getType(), adaptor.getBarriers(), + adaptor.getMbarId(), rewriter); Type retType = rewriter.getI1Type(); - if (isMbarrierShared(op.getBarrier().getType())) { + if (isMbarrierShared(op.getBarriers().getType())) { rewriter.replaceOpWithNewOp( op, retType, barrier, adaptor.getToken()); } else { @@ -873,18 +879,18 @@ struct NVGPUMBarrierTestWaitLowering }; struct NVGPUMBarrierArriveExpectTxLowering - : public ConvertOpToLLVMPattern { - using ConvertOpToLLVMPattern< - nvgpu::MBarrierArriveExpectTxOp>::ConvertOpToLLVMPattern; - + : public MBarrierBasePattern { + using MBarrierBasePattern< + nvgpu::MBarrierArriveExpectTxOp>::MBarrierBasePattern; LogicalResult matchAndRewrite(nvgpu::MBarrierArriveExpectTxOp op, OpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { - Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(), - op.getBarrier(), adaptor.getBarrier()); + Value barrier = + getMbarrierPtr(op, op.getBarriers().getType(), adaptor.getBarriers(), + adaptor.getMbarId(), rewriter); Value txcount = truncToI32(rewriter, op->getLoc(), adaptor.getTxcount()); - if (isMbarrierShared(op.getBarrier().getType())) { + if (isMbarrierShared(op.getBarriers().getType())) { rewriter.replaceOpWithNewOp( op, barrier, txcount); return success(); @@ -897,19 +903,19 @@ struct NVGPUMBarrierArriveExpectTxLowering }; struct NVGPUMBarrierTryWaitParityLowering - : public ConvertOpToLLVMPattern { - using ConvertOpToLLVMPattern< - nvgpu::MBarrierTryWaitP... |
a42d359
to
23b2f8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
23b2f8b
to
26aa6b2
Compare
A common practice involves the creation of multiple MBarrier objects for utilization within loops, see an example below. This is particularly valuable in scenarios like software pipelining during matmul code generation, where we need to generate and employ five barriers dynamically within a loop. This works improves `nvgpu.mbarrier.barrier` type into the `nvgpu.mbarrier.group`. All MBarrier-related operations now uses this type. Consequently, these operations are now capable of managing multiple barriers seamlessly. ``` %barriers = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 3> nvgpu.mbarrier.init %barriers[%c0], %num_threads : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 3> nvgpu.mbarrier.init %barriers[%c1], %num_threads : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 3> nvgpu.mbarrier.init %barriers[%c2], %num_threads : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 3> ... scf.for %i = %c0 to %n step %c1 { %mbarId = arith.remui %i, 3 : index %isDone = nvgpu.mbarrier.test.wait %barriers[%mbarId], %token : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 3>, !tokenType } ``` Differential Revision: https://reviews.llvm.org/D159433
26aa6b2
to
3d2ec4f
Compare
llvm#65951 improved mbarrier supports. This PR adapts that usage in the integration test.
#65951 improved mbarrier supports. This PR adapts that usage in the integration test.
llvm#65951 improved mbarrier supports. This PR adapts that usage in the integration test.
A common practice involves the creation of multiple
mbarrier
objects, see an example below. This is particularly valuable in scenarios like software pipelining for GEMM, where we need to generate multiple barriers dynamically use and wait them in a loop.PR improves
nvgpu.mbarrier.barrier
type into thenvgpu.mbarrier.group
. Allmbarrier
related Ops now uses this type. Consequently, these Ops are now capable of managing multiple barriers seamlessly.Having
num_barriers = 4
helps us to locate mbarrier object(s) into static shared memory. We could make the value dynamic that requires dynamic shared memory it would complicate the codegen.We will have mbarrier usages like below: