-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][flang][OpenMP] Experimental pass to map do concurrent
to OMP
#77285
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-flang-fir-hlfir Author: Kareem Ergawy (ergawy) ChangesAdds a pass to map Full diff: https://github.com/llvm/llvm-project/pull/77285.diff 6 Files Affected:
diff --git a/flang/include/flang/Optimizer/HLFIR/HLFIROps.td b/flang/include/flang/Optimizer/HLFIR/HLFIROps.td
index 254771ca1780cf..718026e4fdfbe7 100644
--- a/flang/include/flang/Optimizer/HLFIR/HLFIROps.td
+++ b/flang/include/flang/Optimizer/HLFIR/HLFIROps.td
@@ -75,7 +75,7 @@ def hlfir_DeclareOp : hlfir_Op<"declare", [AttrSizedOperandSegments,
func.func @foo(%arg0: !fir.ref<!fir.array<?x?x!fir.char<1,?>>>, %arg1: !fir.ref<i64>) {
%c10 = arith.constant 10 : index
%c20 = arith.constant 20 : index
- %1 = fir.load %ag1 : fir.ref<i64>
+ %1 = fir.load %arg1 : fir.ref<i64>
%2 = fir.shape_shift %c10, %1, %c20, %1 : (index, index, index, index) -> !fir.shapeshift<2>
%3 = hfir.declare %arg0(%2) typeparams %1 {uniq_name = "c"} (fir.ref<!fir.array<?x?x!fir.char<1,?>>>, fir.shapeshift<2>, index) -> (fir.box<!fir.array<?x?x!fir.char<1,?>>>, fir.ref<!fir.array<?x?x!fir.char<1,?>>>)
// ... uses %3#0 as "c"
diff --git a/flang/include/flang/Optimizer/Transforms/Passes.h b/flang/include/flang/Optimizer/Transforms/Passes.h
index 6970da8698ae84..1f389a1f571b75 100644
--- a/flang/include/flang/Optimizer/Transforms/Passes.h
+++ b/flang/include/flang/Optimizer/Transforms/Passes.h
@@ -93,6 +93,8 @@ std::unique_ptr<mlir::Pass> createFunctionAttrPass();
std::unique_ptr<mlir::Pass>
createFunctionAttrPass(FunctionAttrTypes &functionAttr);
+std::unique_ptr<mlir::Pass> createDoConcurrentConversionPass();
+
// declarative passes
#define GEN_PASS_REGISTRATION
#include "flang/Optimizer/Transforms/Passes.h.inc"
diff --git a/flang/include/flang/Optimizer/Transforms/Passes.td b/flang/include/flang/Optimizer/Transforms/Passes.td
index e3c45d41f04cc7..41d9be4249b5e2 100644
--- a/flang/include/flang/Optimizer/Transforms/Passes.td
+++ b/flang/include/flang/Optimizer/Transforms/Passes.td
@@ -370,4 +370,24 @@ def FunctionAttr : Pass<"function-attr", "mlir::func::FuncOp"> {
let constructor = "::fir::createFunctionAttrPass()";
}
+def DoConcurrentConversionPass : Pass<"fopenmp-do-concurrent-conversion", "mlir::func::FuncOp"> {
+ let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";
+
+ let description = [{ This is an experimental pass to map `DO CONCURRENR` loops
+ to their correspnding equivalent OpenMP worksharing constructs.
+
+ For now the following is supported:
+ - Mapping simple loops to `parallel do`.
+
+ Still to TODO:
+ - More extensive testing.
+ - Mapping to `target teams distribute parallel do`.
+ - Allowing the user to control mapping behavior: either to the host or
+ target.
+ }];
+
+ let constructor = "::fir::createDoConcurrentConversionPass()";
+ let dependentDialects = ["mlir::omp::OpenMPDialect"];
+}
+
#endif // FLANG_OPTIMIZER_TRANSFORMS_PASSES
diff --git a/flang/lib/Optimizer/Transforms/CMakeLists.txt b/flang/lib/Optimizer/Transforms/CMakeLists.txt
index fc067ad3585395..13a46ef7422530 100644
--- a/flang/lib/Optimizer/Transforms/CMakeLists.txt
+++ b/flang/lib/Optimizer/Transforms/CMakeLists.txt
@@ -21,6 +21,7 @@ add_flang_library(FIRTransforms
OMPMarkDeclareTarget.cpp
VScaleAttr.cpp
FunctionAttr.cpp
+ DoConcurrentConversion.cpp
DEPENDS
FIRDialect
diff --git a/flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp b/flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp
new file mode 100644
index 00000000000000..180c0bdf672af9
--- /dev/null
+++ b/flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp
@@ -0,0 +1,162 @@
+//===- DoConcurrentConversion.cpp -- map `DO CONCURRENT` to OpenMP loops --===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "flang/Optimizer/Dialect/FIRDialect.h"
+#include "flang/Optimizer/Dialect/FIROps.h"
+#include "flang/Optimizer/Dialect/FIRType.h"
+#include "flang/Optimizer/Dialect/Support/FIRContext.h"
+#include "flang/Optimizer/HLFIR/HLFIRDialect.h"
+#include "flang/Optimizer/Transforms/Passes.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/OpenMP/OpenMPDialect.h"
+#include "mlir/IR/Diagnostics.h"
+#include "mlir/IR/IRMapping.h"
+#include "mlir/Pass/Pass.h"
+#include "mlir/Transforms/DialectConversion.h"
+
+#include <memory>
+
+namespace fir {
+#define GEN_PASS_DEF_DOCONCURRENTCONVERSIONPASS
+#include "flang/Optimizer/Transforms/Passes.h.inc"
+} // namespace fir
+
+#define DEBUG_TYPE "fopenmp-do-concurrent-conversion"
+
+namespace {
+class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
+public:
+ using mlir::OpConversionPattern<fir::DoLoopOp>::OpConversionPattern;
+
+ mlir::LogicalResult
+ matchAndRewrite(fir::DoLoopOp doLoop, OpAdaptor adaptor,
+ mlir::ConversionPatternRewriter &rewriter) const override {
+ mlir::OpPrintingFlags flags;
+ flags.printGenericOpForm();
+
+ mlir::omp::ParallelOp parallelOp =
+ rewriter.create<mlir::omp::ParallelOp>(doLoop.getLoc());
+
+ rewriter.createBlock(¶llelOp.getRegion());
+ mlir::Block &block = parallelOp.getRegion().back();
+
+ rewriter.setInsertionPointToEnd(&block);
+ rewriter.create<mlir::omp::TerminatorOp>(doLoop.getLoc());
+
+ rewriter.setInsertionPointToStart(&block);
+
+ // Clone the LB, UB, step defining ops inside the parallel region.
+ llvm::SmallVector<mlir::Value> lowerBound, upperBound, step;
+ lowerBound.push_back(
+ rewriter.clone(*doLoop.getLowerBound().getDefiningOp())->getResult(0));
+ upperBound.push_back(
+ rewriter.clone(*doLoop.getUpperBound().getDefiningOp())->getResult(0));
+ step.push_back(
+ rewriter.clone(*doLoop.getStep().getDefiningOp())->getResult(0));
+
+ auto wsLoopOp = rewriter.create<mlir::omp::WsLoopOp>(
+ doLoop.getLoc(), lowerBound, upperBound, step);
+ wsLoopOp.setInclusive(true);
+
+ auto outlineableOp =
+ mlir::dyn_cast<mlir::omp::OutlineableOpenMPOpInterface>(*parallelOp);
+ assert(outlineableOp);
+ rewriter.setInsertionPointToStart(outlineableOp.getAllocaBlock());
+
+ // For the induction variable, we need to privative its allocation and
+ // binding inside the parallel region.
+ llvm::SmallSetVector<mlir::Operation *, 2> workList;
+ // Therefore, we first discover the induction variable by discovering
+ // `fir.store`s where the source is the loop's block argument.
+ workList.insert(doLoop.getInductionVar().getUsers().begin(),
+ doLoop.getInductionVar().getUsers().end());
+ llvm::SmallSetVector<fir::StoreOp, 2> inductionVarTargetStores;
+
+ // Walk the def-chain of the loop's block argument until we hit `fir.store`.
+ while (!workList.empty()) {
+ mlir::Operation *item = workList.front();
+
+ if (auto storeOp = mlir::dyn_cast<fir::StoreOp>(item)) {
+ inductionVarTargetStores.insert(storeOp);
+ } else {
+ workList.insert(item->getUsers().begin(), item->getUsers().end());
+ }
+
+ workList.remove(item);
+ }
+
+ // For each collected `fir.sotre`, find the target memref's alloca's and
+ // declare ops.
+ llvm::SmallSetVector<mlir::Operation *, 4> declareAndAllocasToClone;
+ for (auto storeOp : inductionVarTargetStores) {
+ mlir::Operation *storeTarget = storeOp.getMemref().getDefiningOp();
+
+ for (auto operand : storeTarget->getOperands()) {
+ declareAndAllocasToClone.insert(operand.getDefiningOp());
+ }
+ declareAndAllocasToClone.insert(storeTarget);
+ }
+
+ mlir::IRMapping mapper;
+
+ // Collect the memref defining ops in the parallel region.
+ for (mlir::Operation *opToClone : declareAndAllocasToClone) {
+ rewriter.clone(*opToClone, mapper);
+ }
+
+ // Clone the loop's body inside the worksharing construct using the mapped
+ // memref values.
+ rewriter.cloneRegionBefore(doLoop.getRegion(), wsLoopOp.getRegion(),
+ wsLoopOp.getRegion().begin(), mapper);
+
+ mlir::Operation *terminator = wsLoopOp.getRegion().back().getTerminator();
+ rewriter.setInsertionPointToEnd(&wsLoopOp.getRegion().back());
+ rewriter.create<mlir::omp::YieldOp>(terminator->getLoc());
+ rewriter.eraseOp(terminator);
+
+ rewriter.eraseOp(doLoop);
+
+ return mlir::success();
+ }
+};
+
+class DoConcurrentConversionPass
+ : public fir::impl::DoConcurrentConversionPassBase<
+ DoConcurrentConversionPass> {
+public:
+ void runOnOperation() override {
+ mlir::func::FuncOp func = getOperation();
+
+ if (func.isDeclaration()) {
+ return;
+ }
+
+ auto *context = &getContext();
+ mlir::RewritePatternSet patterns(context);
+ patterns.insert<DoConcurrentConversion>(context);
+ mlir::ConversionTarget target(*context);
+ target.addLegalDialect<fir::FIROpsDialect, hlfir::hlfirDialect,
+ mlir::arith::ArithDialect, mlir::func::FuncDialect,
+ mlir::omp::OpenMPDialect>();
+
+ target.addDynamicallyLegalOp<fir::DoLoopOp>(
+ [](fir::DoLoopOp op) { return !op.getUnordered(); });
+
+ if (mlir::failed(mlir::applyFullConversion(getOperation(), target,
+ std::move(patterns)))) {
+ mlir::emitError(mlir::UnknownLoc::get(context),
+ "error in converting do-concurrent op");
+ signalPassFailure();
+ }
+ }
+};
+} // namespace
+
+std::unique_ptr<mlir::Pass> fir::createDoConcurrentConversionPass() {
+ return std::make_unique<DoConcurrentConversionPass>();
+}
diff --git a/flang/test/Transforms/DoConcurrent/basic.mlir b/flang/test/Transforms/DoConcurrent/basic.mlir
new file mode 100644
index 00000000000000..b4a2ced45950e5
--- /dev/null
+++ b/flang/test/Transforms/DoConcurrent/basic.mlir
@@ -0,0 +1,60 @@
+// Tests mapping of a basic `do concurrent` loop to `!$omp parallel do`.
+
+// RUN: fir-opt --fopenmp-do-concurrent-conversion %s | FileCheck %s
+
+// CHECK-LABEL: func.func @do_concurrent_basic
+func.func @do_concurrent_basic() attributes {fir.bindc_name = "do_concurrent_basic"} {
+ // CHECK: %[[ARR:.*]]:2 = hlfir.declare %{{.*}}(%{{.*}}) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+ // CHECK: %[[C1:.*]] = arith.constant 1 : i32
+ // CHECK: %[[C10:.*]] = arith.constant 10 : i32
+
+ %0 = fir.alloca i32 {bindc_name = "i"}
+ %1:2 = hlfir.declare %0 {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+ %2 = fir.address_of(@_QFEa) : !fir.ref<!fir.array<10xi32>>
+ %c10 = arith.constant 10 : index
+ %3 = fir.shape %c10 : (index) -> !fir.shape<1>
+ %4:2 = hlfir.declare %2(%3) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+ %c1_i32 = arith.constant 1 : i32
+ %7 = fir.convert %c1_i32 : (i32) -> index
+ %c10_i32 = arith.constant 10 : i32
+ %8 = fir.convert %c10_i32 : (i32) -> index
+ %c1 = arith.constant 1 : index
+
+ // CHECK-NOT: fir.do_loop
+
+ // CHECK: omp.parallel {
+
+ // CHECK-NEXT: %[[ITER_VAR:.*]] = fir.alloca i32 {bindc_name = "i"}
+ // CHECK-NEXT: %[[BINDING:.*]]:2 = hlfir.declare %[[ITER_VAR]] {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+
+ // CHECK: %[[LB:.*]] = fir.convert %[[C1]] : (i32) -> index
+ // CHECK: %[[UB:.*]] = fir.convert %[[C10]] : (i32) -> index
+ // CHECK: %[[STEP:.*]] = arith.constant 1 : index
+
+ // CHECK: omp.wsloop for (%[[ARG0:.*]]) : index = (%[[LB]]) to (%[[UB]]) inclusive step (%[[STEP]]) {
+ // CHECK-NEXT: %[[IV_IDX:.*]] = fir.convert %[[ARG0]] : (index) -> i32
+ // CHECK-NEXT: fir.store %[[IV_IDX]] to %[[BINDING]]#1 : !fir.ref<i32>
+ // CHECK-NEXT: %[[IV_VAL1:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+ // CHECK-NEXT: %[[IV_VAL2:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+ // CHECK-NEXT: %[[IV_VAL_I64:.*]] = fir.convert %[[IV_VAL2]] : (i32) -> i64
+ // CHECK-NEXT: %[[ARR_ACCESS:.*]] = hlfir.designate %[[ARR]]#0 (%[[IV_VAL_I64]]) : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+ // CHECK-NEXT: hlfir.assign %[[IV_VAL1]] to %[[ARR_ACCESS]] : i32, !fir.ref<i32>
+ // CHECK-NEXT: omp.yield
+ // CHECK-NEXT: }
+
+ // CHECK-NEXT: omp.terminator
+ // CHECK-NEXT: }
+ fir.do_loop %arg0 = %7 to %8 step %c1 unordered {
+ %13 = fir.convert %arg0 : (index) -> i32
+ fir.store %13 to %1#1 : !fir.ref<i32>
+ %14 = fir.load %1#0 : !fir.ref<i32>
+ %15 = fir.load %1#0 : !fir.ref<i32>
+ %16 = fir.convert %15 : (i32) -> i64
+ %17 = hlfir.designate %4#0 (%16) : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+ hlfir.assign %14 to %17 : i32, !fir.ref<i32>
+ }
+
+ // CHECK-NOT: fir.do_loop
+
+ return
+ }
|
Nice start. In case you have missed it, there are some recommendations for I guess the things to consider are:
|
Thanks @kiranchandramohan for looking at this!
I also have the feeling that some analyses will be needed but I am not sure what exactly yet. Maybe a starting point would be to verify that input loops respect the constraints listed by the spec, for example: C1121-C1137. I know some of these constraints are enforced by the front-end already before lowering so we will have to consider them one by one and check the best place to check for these constraints. There is also section 11.1.7.5 Additional semantics for DO CONCURRENT constructs in the spec which I think can investigated for further analyses before we apply the mapping to OMP. Do you have any particular analyses in mind outside of this?
Indeed this is currently handled in lowering. Below is an example without Without
|
Thanks @ergawy for the reply. Since there are some strong opinions expressed about |
Sure thing! I will start working on a design doc and open a separate (parent) PR for it. |
Do not run |
Thanks for your reply. Yup, I am looking into understanding the inherent issues in |
It is not complicated. There are two common assumptions about DO CONCURRENT that are not true. First, due to the name, many people assume that DO CONCURRENT is an assertion by the programmer that the loop is safe to parallelize and/or a request to parallelize the loop. Second, that the restrictions imposed by the standard on DO CONCURRENT loops make any conformant program safe to parallelize. The actual definition of DO CONCURRENT is such that it is safe to execute the iterations of the loop in any serial order. They are not sufficient to execute the iterations of the loop in parallel. You can write a DO CONCURRENT loop that is perfectly conformant with the standard but still unsafe to parallelize. The standard could be fixed with a small change, but the committee won't do it. If you have permission from the user via the command line to parallelize, fine. But by default, you still need to analyze the data flow in the loop to guarantee safe parallelization. |
Adds a pass to map `do concurrent` to OpenMP worksharing consturcts. For now, only maps basic loops to `omp parallel do`. This is still a WIP with more work needed for testing and mapping more advanced loops.
This does not detail everything yet, still working through this.
4be76ec
to
07d6c48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ergawy, very late review on my side. Thanks for working on this, it is a great feature. On my side, as long as this pass is run behind a flag (given the points made by Peter), the direction looks good to me (I appreciate the fact that this is done in a pass so that do concurrent can also easily be mapped to other concepts).
let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops."; | ||
|
||
let description = [{ This is an experimental pass to map `DO CONCURRENR` loops | ||
to their correspnding equivalent OpenMP worksharing constructs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to their correspnding equivalent OpenMP worksharing constructs. | |
to their corresponding equivalent OpenMP worksharing constructs. |
Adds a pass to map
do concurrent
to OpenMP worksharing consturcts. For now, only maps basic loops toomp parallel do
. This is still a WIP with more work needed for testing and mapping more advanced loops.So far, the pass is not added to any
flang
pipelines; it only lives as a standalone pass tested byfir-opt
. Once we hopefully agree on the approach and the pass matures, I will add it toflang
.