GPU Attribute Improvements (#25826)

Closes #22822. This PR seeks to improve the experience of using various GPU attributes in several cases. These changes are motivated by some of the difficulties I've observed @Iainmon to have had while working on his GPU-enabled machine learning code. ## Case 1: Writing CPU and GPU code using the GPU locale model The way to ensure that a loop is GPU-eligible in user code (and to fail compilation if the loop is not GPU eligible), is to use `@assertOnGpu`. However, one cannot do this when writing code that is expected to support both GPUs and CPUs. I've observed Iain's code to have something like the following: ```Chapel if onGpu { @assertOnGpu foreach ... { } } else { foreach ... { /* the same loop as above */ } } ``` This way, the code could be used on both the GPU and the CPU, and the compiler will ensure that the GPU version is eligible. However, this introduces a maintenance burden, and makes the code rather verbose. To work around this problem, I introduce a new GPU primitive + attribute: `@gpu.assertEligible`. This attribute has the same behavior as `@assertOnGpu` at compile-time, but it does not have a runtime effect. Thus, the code above can be flattened and continue to support both CPU and GPU runs: ```Chapel if onGpu { @gpu.assertEligible foreach ... { } } ``` In my opinion, we should phase out the use of `@assertOnGpu` in favor of `@gpu.assertEligible`. It's unclear to me that having a runtime assertion using this attribute is worth keeping it around two similar attributes. Personally, I think that the compile-time assertion can be handled by `@gpu.assertEligible,` and various utilities from `GpuDiagnostics` for tracking kernel launches etc. can be used to ensure that GPU execution occurs at runtime. This PR doesn't make this (potentially more controversial change). ## Case 2: Disabling GPU support and compiling with `CHPL_LOCALE_MODEL=flat` When I told Iain to run his performance experiments in the flat locale model (to get started with initial performance results via the CPU), he immediately ran into internal errors. This is an instance of #22822. My chosen solution to this problem is to make `@assertOnGpu` a compile-time error under `CHPL_LOCALE_MODEL=flat`. This is because of the semantics of `@assertOnGpu`: this attribute has a runtime check; without a GPU, the check is guaranteed to fail, and cause a "certain" failure. This error is now user-facing, and tells the user to switch to `@gpu.assertEligible` if all they want is a compile-time check. On the other hand, the `@gpu.assertEligible` attribute, which does not have any runtime semantics, does not cause a compilation error with the `flat` locale model. Instead, the attribute is simply ignored (we don't perform any GPU logic with the flat locale model, and it doesn't seem worth it to actually perform GPU transformations / analysis for the sole purpose of validating GPU eligibility). The same is true for `@gpu.blockSize`, and the non-user-facing "GPU primitive block" primitive which is used to group GPU primitives created via attributes. Thus, the following code compiles and runs just fine in the `flat` locale model: ```Chapel @gpu.assertEligible @gpu.blockSize(128) foreach i in 1..128 { /* ... */ } ``` Reviewed by @e-kayrakli -- thanks! ## Testing - [x] new `flat` tests for GPU primitives, including a new user-facing error. - [x] GPU tests, including new tests for `@gpu.assertEligible` - [x] paratest
chapel-lang · Aug 30, 2024 · d301f0a · d301f0a
2 parents b136ba4 + 594dee7
commit d301f0a
Show file tree

Hide file tree

Showing 39 changed files with 332 additions and 32 deletions.
diff --git a/compiler/AST/primitive.cpp b/compiler/AST/primitive.cpp
@@ -972,6 +972,7 @@ initPrimitive() {
 
   // Generates call that produces runtime error when not run by a GPU
   prim_def(PRIM_ASSERT_ON_GPU, "chpl_assert_on_gpu", returnInfoVoid, true, true);
+  prim_def(PRIM_ASSERT_GPU_ELIGIBLE, "assert gpu eligible", returnInfoVoid, true, true);
   prim_def(PRIM_GPU_ELIGIBLE, "gpu eligible", returnInfoVoid, true, true);
   prim_def(PRIM_GPU_REDUCE_WRAPPER, "gpu reduce wrapper", returnInfoVoid, true);
 

diff --git a/compiler/optimizations/gpuTransforms.cpp b/compiler/optimizations/gpuTransforms.cpp
@@ -410,11 +410,17 @@ class GpuAssertionReporter {
 
   void printNonGpuizableError(CallExpr* assertion, Expr* loc) const {
     debuggerBreakHere();
-    const char* reason = "contains assertOnGpu()";
-    auto isAttributeSym = toSymExpr(assertion->get(1));
-    INT_ASSERT(isAttributeSym);
-    if (isAttributeSym->symbol() == gTrue) {
-      reason = "is marked with @assertOnGpu";
+    const char* reason = nullptr;
+    if (assertion->isPrimitive(PRIM_ASSERT_GPU_ELIGIBLE)) {
+      reason = "is marked with @gpu.assertEligible";
+    } else {
+      INT_ASSERT(assertion->isPrimitive(PRIM_ASSERT_ON_GPU));
+      reason = "contains assertOnGpu()";
+      auto isAttributeSym = toSymExpr(assertion->get(1));
+      INT_ASSERT(isAttributeSym);
+      if (isAttributeSym->symbol() == gTrue) {
+        reason = "is marked with @assertOnGpu";
+      }
     }
     USR_FATAL_CONT(loc, "Loop %s but is not eligible for execution on a GPU", reason);
   }
@@ -617,6 +623,16 @@ bool GpuizableLoop::isReportWorthy() {
   return true;
 }
 
+static CallExpr* toCallToGpuEligibilityPrimitive(Expr* expr) {
+  CallExpr *call = toCallExpr(expr);
+  if (call &&
+      (call->isPrimitive(PRIM_ASSERT_ON_GPU) ||
+       call->isPrimitive(PRIM_ASSERT_GPU_ELIGIBLE))) {
+    return call;
+  }
+  return nullptr;
+}
+
 CallExpr* GpuizableLoop::findCompileTimeGpuAssertions() {
   CForLoop *cfl = this->loop_;
   INT_ASSERT(cfl);
@@ -629,8 +645,7 @@ CallExpr* GpuizableLoop::findCompileTimeGpuAssertions() {
   // assign to the loop iteration variable if we're iterating
   // over values rather than indices)
   for_alist(expr, cfl->body) {
-    CallExpr *call = toCallExpr(expr);
-    if (call && call->isPrimitive(PRIM_ASSERT_ON_GPU)) {
+    if (auto call = toCallToGpuEligibilityPrimitive(expr)) {
       return call;
     }
 
@@ -639,8 +654,7 @@ CallExpr* GpuizableLoop::findCompileTimeGpuAssertions() {
     BlockStmt *blk = toBlockStmt(expr);
     if (blk && blk->isGpuPrimitivesBlock()) {
       for_alist(expr, blk->body) {
-        CallExpr *call = toCallExpr(expr);
-        if (call && call->isPrimitive(PRIM_ASSERT_ON_GPU)) {
+        if (auto call = toCallToGpuEligibilityPrimitive(expr)) {
           return call;
         }
       }
@@ -1576,6 +1590,7 @@ bool isCallToPrimitiveWeShouldNotCopyIntoKernel(CallExpr *call) {
   if (!call) return false;
 
   return call->isPrimitive(PRIM_ASSERT_ON_GPU) ||
+         call->isPrimitive(PRIM_ASSERT_GPU_ELIGIBLE) ||
          call->isPrimitive(PRIM_GPU_SET_BLOCKSIZE) ||
          call->isPrimitive(PRIM_GPU_PRIMITIVE_BLOCK);
 }
@@ -2144,8 +2159,13 @@ static void cleanupPrimitives() {
       // uses of the primitive, which we process by removing the primitive but keeping
       // the copy.
       cleanupTaskIndependentCapturePrimitive(callExpr);
-    }
-    else if(callExpr->isPrimitive(PRIM_GPU_SET_BLOCKSIZE)) {
+    } else if (callExpr->isPrimitive(PRIM_GPU_SET_BLOCKSIZE) ||
+               callExpr->isPrimitive(PRIM_ASSERT_GPU_ELIGIBLE)) {
+      callExpr->remove();
+    } else if(callExpr->isPrimitive(PRIM_GPU_PRIMITIVE_BLOCK)) {
+      auto parentBlock = toBlockStmt(callExpr->parentExpr);
+      INT_ASSERT(parentBlock);
+      parentBlock->flattenAndRemove();
       callExpr->remove();
     }
   }

diff --git a/compiler/passes/convert-uast.cpp b/compiler/passes/convert-uast.cpp
@@ -126,6 +126,9 @@ struct LoopAttributeInfo {
   LLVMMetadataList llvmMetadata;
   // The @assertOnGpu attribute, if one is provided by the user.
   const uast::Attribute* assertOnGpuAttr = nullptr;
+  // The @gpu.assertEligible attribute, which asserts GPU eligibility,
+  // if one is provided by the user.
+  const uast::Attribute* assertEligibleAttr = nullptr;
   // The @gpu.blockSize attribute, if one is provided by the user.
   const uast::Attribute* blockSizeAttr = nullptr;
 
@@ -207,6 +210,7 @@ struct LoopAttributeInfo {
 
   void readNativeGpuAttributes(const uast::AttributeGroup* attrs) {
     this->assertOnGpuAttr = attrs->getAttributeNamed(USTR("assertOnGpu"));
+    this->assertEligibleAttr = attrs->getAttributeNamed(USTR("gpu.assertEligible"));
     this->blockSizeAttr = attrs->getAttributeNamed(USTR("gpu.blockSize"));
   }
 
@@ -238,6 +242,7 @@ struct LoopAttributeInfo {
   bool empty() const {
     return llvmMetadata.size() == 0 &&
            assertOnGpuAttr == nullptr &&
+           assertEligibleAttr == nullptr &&
            blockSizeAttr == nullptr;
   }
 
@@ -462,12 +467,6 @@ struct Converter {
     return nullptr;
   }
 
-  void readNativeGpuAttributes(LoopAttributeInfo& into,
-                               const uast::AttributeGroup* attrs) {
-    into.assertOnGpuAttr = attrs->getAttributeNamed(USTR("assertOnGpu"));
-    into.blockSizeAttr = attrs->getAttributeNamed(USTR("gpu.blockSize"));
-  }
-
   Expr* visit(const uast::AttributeGroup* node) {
     INT_FATAL("Should not be called directly!");
     return nullptr;
@@ -1760,6 +1759,9 @@ struct Converter {
     if (loopAttributes.assertOnGpuAttr != nullptr) {
       CHPL_REPORT(context, InvalidGpuAssertion, node,
                   loopAttributes.assertOnGpuAttr);
+    } else if (loopAttributes.assertEligibleAttr != nullptr) {
+      CHPL_REPORT(context, InvalidGpuAssertion, node,
+                  loopAttributes.assertEligibleAttr);
     }
     return std::move(loopAttributes.llvmMetadata);
   }
@@ -4366,12 +4368,16 @@ struct Converter {
 };
 
 bool LoopAttributeInfo::insertGpuEligibilityAssertion(BlockStmt* body) {
+  bool inserted = false;
   if (assertOnGpuAttr) {
-    body->insertAtTail(new CallExpr(PRIM_ASSERT_ON_GPU,
-                                    new SymExpr(gTrue)));
-    return true;
+    body->insertAtTail(new CallExpr("chpl__assertOnGpuAttr"));
+    inserted = true;
   }
-  return false;
+  if (assertEligibleAttr) {
+    body->insertAtTail(new CallExpr("chpl__gpuAssertEligibleAttr"));
+    inserted = true;
+  }
+  return inserted;
 }
 
 bool LoopAttributeInfo::insertBlockSizeCall(Converter& converter, BlockStmt* body) {
@@ -4384,16 +4390,11 @@ bool LoopAttributeInfo::insertBlockSizeCall(Converter& converter, BlockStmt* bod
   static int counter = 0;
 
   if (blockSizeAttr) {
-    if (blockSizeAttr->numActuals() != 1) {
-      USR_FATAL(blockSizeAttr->id(),
-                "'@gpu.blockSize' attribute must have exactly one argument: "
-                "the block size");
+    auto newCall = new CallExpr("chpl__gpuBlockSizeAttr", new_IntSymbol(counter++));
+    for (auto actual : blockSizeAttr->actuals()) {
+      newCall->insertAtTail(converter.convertAST(actual));
     }
-
-    Expr* blockSize = converter.convertAST(blockSizeAttr->actual(0));
-    body->insertAtTail(new CallExpr(PRIM_GPU_SET_BLOCKSIZE,
-                                    blockSize,
-                                    new_IntSymbol(counter++)));
+    body->insertAtTail(newCall);
     return true;
   }
   return false;

diff --git a/frontend/include/chpl/framework/all-global-strings.h b/frontend/include/chpl/framework/all-global-strings.h
@@ -57,6 +57,7 @@ X(forall              , "forall")
 X(foreach             , "foreach")
 X(functionStatic      , "functionStatic")
 X(generate            , "generate")
+X(gpuAssertEligible   , "gpu.assertEligible")
 X(gpuBlockSize        , "gpu.blockSize")
 X(hash_               , "hash")
 X(imag_               , "imag")

diff --git a/frontend/include/chpl/uast/prim-ops-list.h b/frontend/include/chpl/uast/prim-ops-list.h
@@ -173,6 +173,7 @@ PRIMITIVE_G(GPU_ALLOC_SHARED, "gpu allocShared")
 PRIMITIVE_G(GPU_SYNC_THREADS, "gpu syncThreads")
 PRIMITIVE_R(GPU_SET_BLOCKSIZE, "gpu set blockSize")
 PRIMITIVE_G(ASSERT_ON_GPU, "chpl_assert_on_gpu")
+PRIMITIVE_R(ASSERT_GPU_ELIGIBLE, "assert gpu eligible")
 PRIMITIVE_R(GPU_ELIGIBLE, "gpu eligible")
 PRIMITIVE_G(GPU_INIT_KERNEL_CFG, "gpu init kernel cfg")
 PRIMITIVE_G(GPU_INIT_KERNEL_CFG_3D, "gpu init kernel cfg 3d")

diff --git a/frontend/lib/resolution/prims.cpp b/frontend/lib/resolution/prims.cpp
@@ -1636,6 +1636,10 @@ CallResolutionResult resolvePrimCall(Context* context,
       type = primAssertOnGpu(context, ci);
       break;
 
+    case PRIM_ASSERT_GPU_ELIGIBLE:
+      type = QualifiedType(QualifiedType::CONST_VAR, VoidType::get(context));
+      break;
+
     case PRIM_GPU_INIT_KERNEL_CFG:
     case PRIM_GPU_INIT_KERNEL_CFG_3D:
       type = QualifiedType(QualifiedType::CONST_VAR, CPtrType::getCVoidPtrType(context));

diff --git a/frontend/lib/uast/post-parse-checks.cpp b/frontend/lib/uast/post-parse-checks.cpp
@@ -1620,6 +1620,7 @@ void Visitor::checkAttributeNameRecognizedOrToolSpaced(const Attribute* node) {
              node->name() == USTR("stable") ||
              node->name() == USTR("functionStatic") ||
              node->name() == USTR("assertOnGpu") ||
+             node->name() == USTR("gpu.assertEligible") ||
              node->name() == USTR("gpu.blockSize") ||
              node->name().startsWith(USTR("chpldoc.")) ||
              node->name().startsWith(USTR("chplcheck.")) ||
@@ -1653,13 +1654,15 @@ void Visitor::checkAttributeAppliedToCorrectNode(const Attribute* attr) {
   auto attributeGroup = parents_[parents_.size() - 1];
   CHPL_ASSERT(attributeGroup->isAttributeGroup());
   auto node = parents_[parents_.size() - 2];
-  if (attr->name() == USTR("assertOnGpu") || attr->name() == USTR("gpu.blockSize")) {
+  if (attr->name() == USTR("assertOnGpu") ||
+      attr->name() == USTR("gpu.blockSize") ||
+      attr->name() == USTR("gpu.assertEligible")) {
     if (node->isForall() || node->isForeach()) return;
     if (auto var = node->toVariable()) {
        if (!var->isField()) return;
     }
 
-    if (attr->name() == USTR("assertOnGpu")) {
+    if (attr->name() == USTR("assertOnGpu") || attr->name() == USTR("gpu.assertEligible")) {
       CHPL_REPORT(context_, InvalidGpuAssertion, node, attr);
     } else {
       CHPL_ASSERT(attr->name() == USTR("gpu.blockSize"));

diff --git a/modules/internal/ChapelStandard.chpl b/modules/internal/ChapelStandard.chpl
@@ -81,6 +81,7 @@ module ChapelStandard {
   // Standard modules.
   public use Types as Types;
   public use AutoMath as AutoMath;
+  public use AutoGpu as AutoGpu;
 
   use stopInitCommDiags;  // Internal, but uses standard/CommDiagnostics
 }
diff --git a/modules/standard/AutoGpu.chpl b/modules/standard/AutoGpu.chpl
@@ -0,0 +1,64 @@
+/*
+ * Copyright 2024 Hewlett Packard Enterprise Development LP
+ * Other additional copyright holders may be indicated within.
+ *
+ * The entirety of this work is licensed under the Apache License,
+ * Version 2.0 (the "License"); you may not use this file except
+ * in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+pragma "module included by default"
+@unstable("The module name 'AutoGpu' is unstable.")
+module AutoGpu {
+  // This module supports GPU-specific attributes like @gpu.assertEligible
+  // @assertOnGpu. These attributes are translated into calls to procedures
+  // in this module as part of the loop body, which insert various GPU
+  // primitives. The primitives are used to configure the GPU execution.
+
+  use ChplConfig;
+  use Errors;
+
+  inline proc chpl__gpuAssertEligibleAttr() {
+    if CHPL_LOCALE_MODEL == "gpu" then
+      __primitive("assert gpu eligible");
+  }
+
+  config param silenceAssertOnGpuWarning = false;
+
+  inline proc chpl__assertOnGpuAttr() {
+    if CHPL_LOCALE_MODEL != "gpu" && !silenceAssertOnGpuWarning {
+      compilerWarning("@assertOnGpu encountered in non-GPU compilation");
+      compilerWarning("this attribute has a runtime component, and will ",
+                      "always halt execution in a non-GPU context.");
+      compilerWarning("consider using '@gpu.assertEligible' to ensure ",
+                      "that the code can be executed on the GPU without ",
+                      "runtime checks.");
+    }
+    __primitive("chpl_assert_on_gpu", true);
+  }
+
+  inline proc chpl__gpuBlockSizeAttr(param counter: int, arg: integral) {
+    if CHPL_LOCALE_MODEL == "gpu" then
+      __primitive("gpu set blockSize", arg, counter);
+  }
+
+  pragma "last resort"
+  inline proc chpl__gpuBlockSizeAttr(param counter: int, rest ...) {
+    compilerError("'@gpu.blockSize' attribute must have exactly one argument: an integral value for the block size");
+  }
+
+  pragma "last resort"
+  inline proc chpl__gpuBlockSizeAttr(param counter: int) {
+    compilerError("'@gpu.blockSize' attribute must have exactly one argument: an integral value for the block size");
+  }
+}
diff --git a/runtime/include/chpl-gpu.h b/runtime/include/chpl-gpu.h
@@ -201,6 +201,15 @@ GPU_CUB_WRAP(DECL_ONE_SORT, keys);
 
 #undef DECL_ONE_SORT
 
+#else // HAS_GPU_LOCALE
+
+// Provide a fallback for the chpl_assert_on_gpu function for non-GPU locales.
+// This works exactly the same as the standard one.
+
+static inline void chpl_assert_on_gpu(int32_t ln, int32_t fn) {
+  chpl_error("assertOnGpu() failed", ln, fn);
+}
+
 #endif // HAS_GPU_LOCALE
 
 #ifdef __cplusplus

diff --git a/test/compflags/ferguson/print-module-resolution.good b/test/compflags/ferguson/print-module-resolution.good
@@ -210,6 +210,8 @@ ChapelStaticVars
   from print-module-resolution.ChapelStandard.ChapelStaticVars
 ChapelRemoteVars
   from print-module-resolution.ChapelStandard.ChapelRemoteVars
+AutoGpu
+  from print-module-resolution.ChapelStandard.AutoGpu
 stopInitCommDiags
   from print-module-resolution.ChapelStandard.stopInitCommDiags
 ChapelStandard

diff --git a/test/gpu/native/assertEligibleNoRuntime.chpl b/test/gpu/native/assertEligibleNoRuntime.chpl
@@ -0,0 +1,4 @@
+@gpu.assertEligible
+var A = foreach i in 1..100 do i;
+
+writeln("all is good; '@gpu.assertEligible' doesn't require GPU execution.");
diff --git a/test/gpu/native/assertEligibleNoRuntime.good b/test/gpu/native/assertEligibleNoRuntime.good
@@ -0,0 +1 @@
+all is good; '@gpu.assertEligible' doesn't require GPU execution.
diff --git a/test/gpu/native/assertOnNotGpuEligible.1.good b/test/gpu/native/assertOnNotGpuEligible.1.good
@@ -0,0 +1,3 @@
+assertOnNotGpuEligible.chpl:15: In function 'funcMarkedNotGpuizableThatTriesToGpuize':
+assertOnNotGpuEligible.chpl:17: error: Loop is marked with @gpu.assertEligible but is not eligible for execution on a GPU
+assertOnNotGpuEligible.chpl:15: note: parent function disallows execution on a GPU
diff --git a/test/gpu/native/assertOnNotGpuEligible.2.good b/test/gpu/native/assertOnNotGpuEligible.2.good
@@ -0,0 +1,3 @@
+assertOnNotGpuEligible.chpl:32: error: Loop is marked with @gpu.assertEligible but is not eligible for execution on a GPU
+assertOnNotGpuEligible.chpl:23: note: function is marked as not eligible for GPU execution
+assertOnNotGpuEligible.chpl:33: note:   reached via call to 'funcMarkedNotGpuizable' in loop body here
diff --git a/test/gpu/native/assertOnNotGpuEligible.3.good b/test/gpu/native/assertOnNotGpuEligible.3.good
@@ -0,0 +1,3 @@
+assertOnNotGpuEligible.chpl:39: error: Loop is marked with @gpu.assertEligible but is not eligible for execution on a GPU
+assertOnNotGpuEligible.chpl:12: note: called function has outer var access
+assertOnNotGpuEligible.chpl:40: note:   reached via call to 'usesOutsideVar' in loop body here
diff --git a/test/gpu/native/assertOnNotGpuEligible.4.good b/test/gpu/native/assertOnNotGpuEligible.4.good
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		all is good; '@gpu.assertEligible' doesn't require GPU execution.