Skip to content

Commit

Permalink
GPU Attribute Improvements (#25826)
Browse files Browse the repository at this point in the history
Closes #22822.

This PR seeks to improve the experience of using various GPU attributes
in several cases. These changes are motivated by some of the
difficulties I've observed @Iainmon to have had while working on his
GPU-enabled machine learning code.

## Case 1: Writing CPU and GPU code using the GPU locale model
The way to ensure that a loop is GPU-eligible in user code (and to fail
compilation if the loop is not GPU eligible), is to use `@assertOnGpu`.
However, one cannot do this when writing code that is expected to
support both GPUs and CPUs. I've observed Iain's code to have something
like the following:

```Chapel
if onGpu {
  @assertOnGpu
  foreach ... { }
} else {
  foreach ... { /* the same loop as above */ }
}
```

This way, the code could be used on both the GPU and the CPU, and the
compiler will ensure that the GPU version is eligible. However, this
introduces a maintenance burden, and makes the code rather verbose. To
work around this problem, I introduce a new GPU primitive + attribute:
`@gpu.assertEligible`. This attribute has the same behavior as
`@assertOnGpu` at compile-time, but it does not have a runtime effect.
Thus, the code above can be flattened and continue to support both CPU
and GPU runs:

```Chapel
if onGpu {
  @gpu.assertEligible
  foreach ... { }
}
```

In my opinion, we should phase out the use of `@assertOnGpu` in favor of
`@gpu.assertEligible`. It's unclear to me that having a runtime
assertion using this attribute is worth keeping it around two similar
attributes. Personally, I think that the compile-time assertion can be
handled by `@gpu.assertEligible,` and various utilities from
`GpuDiagnostics` for tracking kernel launches etc. can be used to ensure
that GPU execution occurs at runtime. This PR doesn't make this
(potentially more controversial change).

## Case 2: Disabling GPU support and compiling with
`CHPL_LOCALE_MODEL=flat`
When I told Iain to run his performance experiments in the flat locale
model (to get started with initial performance results via the CPU), he
immediately ran into internal errors. This is an instance of
#22822.

My chosen solution to this problem is to make `@assertOnGpu` a
compile-time error under `CHPL_LOCALE_MODEL=flat`. This is because of
the semantics of `@assertOnGpu`: this attribute has a runtime check;
without a GPU, the check is guaranteed to fail, and cause a "certain"
failure. This error is now user-facing, and tells the user to switch to
`@gpu.assertEligible` if all they want is a compile-time check.

On the other hand, the `@gpu.assertEligible` attribute, which does not
have any runtime semantics, does not cause a compilation error with the
`flat` locale model. Instead, the attribute is simply ignored (we don't
perform any GPU logic with the flat locale model, and it doesn't seem
worth it to actually perform GPU transformations / analysis for the sole
purpose of validating GPU eligibility). The same is true for
`@gpu.blockSize`, and the non-user-facing "GPU primitive block"
primitive which is used to group GPU primitives created via attributes.
Thus, the following code compiles and runs just fine in the `flat`
locale model:

```Chapel
@gpu.assertEligible
@gpu.blockSize(128)
foreach i in 1..128 { /* ... */ }
```

Reviewed by @e-kayrakli -- thanks!

## Testing
- [x] new `flat` tests for GPU primitives, including a new user-facing
error.
- [x] GPU tests, including new tests for `@gpu.assertEligible`
- [x] paratest
  • Loading branch information
DanilaFe authored Aug 30, 2024
2 parents b136ba4 + 594dee7 commit d301f0a
Show file tree
Hide file tree
Showing 39 changed files with 332 additions and 32 deletions.
1 change: 1 addition & 0 deletions compiler/AST/primitive.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -972,6 +972,7 @@ initPrimitive() {

// Generates call that produces runtime error when not run by a GPU
prim_def(PRIM_ASSERT_ON_GPU, "chpl_assert_on_gpu", returnInfoVoid, true, true);
prim_def(PRIM_ASSERT_GPU_ELIGIBLE, "assert gpu eligible", returnInfoVoid, true, true);
prim_def(PRIM_GPU_ELIGIBLE, "gpu eligible", returnInfoVoid, true, true);
prim_def(PRIM_GPU_REDUCE_WRAPPER, "gpu reduce wrapper", returnInfoVoid, true);

Expand Down
42 changes: 31 additions & 11 deletions compiler/optimizations/gpuTransforms.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -410,11 +410,17 @@ class GpuAssertionReporter {

void printNonGpuizableError(CallExpr* assertion, Expr* loc) const {
debuggerBreakHere();
const char* reason = "contains assertOnGpu()";
auto isAttributeSym = toSymExpr(assertion->get(1));
INT_ASSERT(isAttributeSym);
if (isAttributeSym->symbol() == gTrue) {
reason = "is marked with @assertOnGpu";
const char* reason = nullptr;
if (assertion->isPrimitive(PRIM_ASSERT_GPU_ELIGIBLE)) {
reason = "is marked with @gpu.assertEligible";
} else {
INT_ASSERT(assertion->isPrimitive(PRIM_ASSERT_ON_GPU));
reason = "contains assertOnGpu()";
auto isAttributeSym = toSymExpr(assertion->get(1));
INT_ASSERT(isAttributeSym);
if (isAttributeSym->symbol() == gTrue) {
reason = "is marked with @assertOnGpu";
}
}
USR_FATAL_CONT(loc, "Loop %s but is not eligible for execution on a GPU", reason);
}
Expand Down Expand Up @@ -617,6 +623,16 @@ bool GpuizableLoop::isReportWorthy() {
return true;
}

static CallExpr* toCallToGpuEligibilityPrimitive(Expr* expr) {
CallExpr *call = toCallExpr(expr);
if (call &&
(call->isPrimitive(PRIM_ASSERT_ON_GPU) ||
call->isPrimitive(PRIM_ASSERT_GPU_ELIGIBLE))) {
return call;
}
return nullptr;
}

CallExpr* GpuizableLoop::findCompileTimeGpuAssertions() {
CForLoop *cfl = this->loop_;
INT_ASSERT(cfl);
Expand All @@ -629,8 +645,7 @@ CallExpr* GpuizableLoop::findCompileTimeGpuAssertions() {
// assign to the loop iteration variable if we're iterating
// over values rather than indices)
for_alist(expr, cfl->body) {
CallExpr *call = toCallExpr(expr);
if (call && call->isPrimitive(PRIM_ASSERT_ON_GPU)) {
if (auto call = toCallToGpuEligibilityPrimitive(expr)) {
return call;
}

Expand All @@ -639,8 +654,7 @@ CallExpr* GpuizableLoop::findCompileTimeGpuAssertions() {
BlockStmt *blk = toBlockStmt(expr);
if (blk && blk->isGpuPrimitivesBlock()) {
for_alist(expr, blk->body) {
CallExpr *call = toCallExpr(expr);
if (call && call->isPrimitive(PRIM_ASSERT_ON_GPU)) {
if (auto call = toCallToGpuEligibilityPrimitive(expr)) {
return call;
}
}
Expand Down Expand Up @@ -1576,6 +1590,7 @@ bool isCallToPrimitiveWeShouldNotCopyIntoKernel(CallExpr *call) {
if (!call) return false;

return call->isPrimitive(PRIM_ASSERT_ON_GPU) ||
call->isPrimitive(PRIM_ASSERT_GPU_ELIGIBLE) ||
call->isPrimitive(PRIM_GPU_SET_BLOCKSIZE) ||
call->isPrimitive(PRIM_GPU_PRIMITIVE_BLOCK);
}
Expand Down Expand Up @@ -2144,8 +2159,13 @@ static void cleanupPrimitives() {
// uses of the primitive, which we process by removing the primitive but keeping
// the copy.
cleanupTaskIndependentCapturePrimitive(callExpr);
}
else if(callExpr->isPrimitive(PRIM_GPU_SET_BLOCKSIZE)) {
} else if (callExpr->isPrimitive(PRIM_GPU_SET_BLOCKSIZE) ||
callExpr->isPrimitive(PRIM_ASSERT_GPU_ELIGIBLE)) {
callExpr->remove();
} else if(callExpr->isPrimitive(PRIM_GPU_PRIMITIVE_BLOCK)) {
auto parentBlock = toBlockStmt(callExpr->parentExpr);
INT_ASSERT(parentBlock);
parentBlock->flattenAndRemove();
callExpr->remove();
}
}
Expand Down
39 changes: 20 additions & 19 deletions compiler/passes/convert-uast.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,9 @@ struct LoopAttributeInfo {
LLVMMetadataList llvmMetadata;
// The @assertOnGpu attribute, if one is provided by the user.
const uast::Attribute* assertOnGpuAttr = nullptr;
// The @gpu.assertEligible attribute, which asserts GPU eligibility,
// if one is provided by the user.
const uast::Attribute* assertEligibleAttr = nullptr;
// The @gpu.blockSize attribute, if one is provided by the user.
const uast::Attribute* blockSizeAttr = nullptr;

Expand Down Expand Up @@ -207,6 +210,7 @@ struct LoopAttributeInfo {

void readNativeGpuAttributes(const uast::AttributeGroup* attrs) {
this->assertOnGpuAttr = attrs->getAttributeNamed(USTR("assertOnGpu"));
this->assertEligibleAttr = attrs->getAttributeNamed(USTR("gpu.assertEligible"));
this->blockSizeAttr = attrs->getAttributeNamed(USTR("gpu.blockSize"));
}

Expand Down Expand Up @@ -238,6 +242,7 @@ struct LoopAttributeInfo {
bool empty() const {
return llvmMetadata.size() == 0 &&
assertOnGpuAttr == nullptr &&
assertEligibleAttr == nullptr &&
blockSizeAttr == nullptr;
}

Expand Down Expand Up @@ -462,12 +467,6 @@ struct Converter {
return nullptr;
}

void readNativeGpuAttributes(LoopAttributeInfo& into,
const uast::AttributeGroup* attrs) {
into.assertOnGpuAttr = attrs->getAttributeNamed(USTR("assertOnGpu"));
into.blockSizeAttr = attrs->getAttributeNamed(USTR("gpu.blockSize"));
}

Expr* visit(const uast::AttributeGroup* node) {
INT_FATAL("Should not be called directly!");
return nullptr;
Expand Down Expand Up @@ -1760,6 +1759,9 @@ struct Converter {
if (loopAttributes.assertOnGpuAttr != nullptr) {
CHPL_REPORT(context, InvalidGpuAssertion, node,
loopAttributes.assertOnGpuAttr);
} else if (loopAttributes.assertEligibleAttr != nullptr) {
CHPL_REPORT(context, InvalidGpuAssertion, node,
loopAttributes.assertEligibleAttr);
}
return std::move(loopAttributes.llvmMetadata);
}
Expand Down Expand Up @@ -4366,12 +4368,16 @@ struct Converter {
};

bool LoopAttributeInfo::insertGpuEligibilityAssertion(BlockStmt* body) {
bool inserted = false;
if (assertOnGpuAttr) {
body->insertAtTail(new CallExpr(PRIM_ASSERT_ON_GPU,
new SymExpr(gTrue)));
return true;
body->insertAtTail(new CallExpr("chpl__assertOnGpuAttr"));
inserted = true;
}
return false;
if (assertEligibleAttr) {
body->insertAtTail(new CallExpr("chpl__gpuAssertEligibleAttr"));
inserted = true;
}
return inserted;
}

bool LoopAttributeInfo::insertBlockSizeCall(Converter& converter, BlockStmt* body) {
Expand All @@ -4384,16 +4390,11 @@ bool LoopAttributeInfo::insertBlockSizeCall(Converter& converter, BlockStmt* bod
static int counter = 0;

if (blockSizeAttr) {
if (blockSizeAttr->numActuals() != 1) {
USR_FATAL(blockSizeAttr->id(),
"'@gpu.blockSize' attribute must have exactly one argument: "
"the block size");
auto newCall = new CallExpr("chpl__gpuBlockSizeAttr", new_IntSymbol(counter++));
for (auto actual : blockSizeAttr->actuals()) {
newCall->insertAtTail(converter.convertAST(actual));
}

Expr* blockSize = converter.convertAST(blockSizeAttr->actual(0));
body->insertAtTail(new CallExpr(PRIM_GPU_SET_BLOCKSIZE,
blockSize,
new_IntSymbol(counter++)));
body->insertAtTail(newCall);
return true;
}
return false;
Expand Down
1 change: 1 addition & 0 deletions frontend/include/chpl/framework/all-global-strings.h
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ X(forall , "forall")
X(foreach , "foreach")
X(functionStatic , "functionStatic")
X(generate , "generate")
X(gpuAssertEligible , "gpu.assertEligible")
X(gpuBlockSize , "gpu.blockSize")
X(hash_ , "hash")
X(imag_ , "imag")
Expand Down
1 change: 1 addition & 0 deletions frontend/include/chpl/uast/prim-ops-list.h
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@ PRIMITIVE_G(GPU_ALLOC_SHARED, "gpu allocShared")
PRIMITIVE_G(GPU_SYNC_THREADS, "gpu syncThreads")
PRIMITIVE_R(GPU_SET_BLOCKSIZE, "gpu set blockSize")
PRIMITIVE_G(ASSERT_ON_GPU, "chpl_assert_on_gpu")
PRIMITIVE_R(ASSERT_GPU_ELIGIBLE, "assert gpu eligible")
PRIMITIVE_R(GPU_ELIGIBLE, "gpu eligible")
PRIMITIVE_G(GPU_INIT_KERNEL_CFG, "gpu init kernel cfg")
PRIMITIVE_G(GPU_INIT_KERNEL_CFG_3D, "gpu init kernel cfg 3d")
Expand Down
4 changes: 4 additions & 0 deletions frontend/lib/resolution/prims.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1636,6 +1636,10 @@ CallResolutionResult resolvePrimCall(Context* context,
type = primAssertOnGpu(context, ci);
break;

case PRIM_ASSERT_GPU_ELIGIBLE:
type = QualifiedType(QualifiedType::CONST_VAR, VoidType::get(context));
break;

case PRIM_GPU_INIT_KERNEL_CFG:
case PRIM_GPU_INIT_KERNEL_CFG_3D:
type = QualifiedType(QualifiedType::CONST_VAR, CPtrType::getCVoidPtrType(context));
Expand Down
7 changes: 5 additions & 2 deletions frontend/lib/uast/post-parse-checks.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1620,6 +1620,7 @@ void Visitor::checkAttributeNameRecognizedOrToolSpaced(const Attribute* node) {
node->name() == USTR("stable") ||
node->name() == USTR("functionStatic") ||
node->name() == USTR("assertOnGpu") ||
node->name() == USTR("gpu.assertEligible") ||
node->name() == USTR("gpu.blockSize") ||
node->name().startsWith(USTR("chpldoc.")) ||
node->name().startsWith(USTR("chplcheck.")) ||
Expand Down Expand Up @@ -1653,13 +1654,15 @@ void Visitor::checkAttributeAppliedToCorrectNode(const Attribute* attr) {
auto attributeGroup = parents_[parents_.size() - 1];
CHPL_ASSERT(attributeGroup->isAttributeGroup());
auto node = parents_[parents_.size() - 2];
if (attr->name() == USTR("assertOnGpu") || attr->name() == USTR("gpu.blockSize")) {
if (attr->name() == USTR("assertOnGpu") ||
attr->name() == USTR("gpu.blockSize") ||
attr->name() == USTR("gpu.assertEligible")) {
if (node->isForall() || node->isForeach()) return;
if (auto var = node->toVariable()) {
if (!var->isField()) return;
}

if (attr->name() == USTR("assertOnGpu")) {
if (attr->name() == USTR("assertOnGpu") || attr->name() == USTR("gpu.assertEligible")) {
CHPL_REPORT(context_, InvalidGpuAssertion, node, attr);
} else {
CHPL_ASSERT(attr->name() == USTR("gpu.blockSize"));
Expand Down
1 change: 1 addition & 0 deletions modules/internal/ChapelStandard.chpl
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ module ChapelStandard {
// Standard modules.
public use Types as Types;
public use AutoMath as AutoMath;
public use AutoGpu as AutoGpu;

use stopInitCommDiags; // Internal, but uses standard/CommDiagnostics
}
64 changes: 64 additions & 0 deletions modules/standard/AutoGpu.chpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
/*
* Copyright 2024 Hewlett Packard Enterprise Development LP
* Other additional copyright holders may be indicated within.
*
* The entirety of this work is licensed under the Apache License,
* Version 2.0 (the "License"); you may not use this file except
* in compliance with the License.
*
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

pragma "module included by default"
@unstable("The module name 'AutoGpu' is unstable.")
module AutoGpu {
// This module supports GPU-specific attributes like @gpu.assertEligible
// @assertOnGpu. These attributes are translated into calls to procedures
// in this module as part of the loop body, which insert various GPU
// primitives. The primitives are used to configure the GPU execution.

use ChplConfig;
use Errors;

inline proc chpl__gpuAssertEligibleAttr() {
if CHPL_LOCALE_MODEL == "gpu" then
__primitive("assert gpu eligible");
}

config param silenceAssertOnGpuWarning = false;

inline proc chpl__assertOnGpuAttr() {
if CHPL_LOCALE_MODEL != "gpu" && !silenceAssertOnGpuWarning {
compilerWarning("@assertOnGpu encountered in non-GPU compilation");
compilerWarning("this attribute has a runtime component, and will ",
"always halt execution in a non-GPU context.");
compilerWarning("consider using '@gpu.assertEligible' to ensure ",
"that the code can be executed on the GPU without ",
"runtime checks.");
}
__primitive("chpl_assert_on_gpu", true);
}

inline proc chpl__gpuBlockSizeAttr(param counter: int, arg: integral) {
if CHPL_LOCALE_MODEL == "gpu" then
__primitive("gpu set blockSize", arg, counter);
}

pragma "last resort"
inline proc chpl__gpuBlockSizeAttr(param counter: int, rest ...) {
compilerError("'@gpu.blockSize' attribute must have exactly one argument: an integral value for the block size");
}

pragma "last resort"
inline proc chpl__gpuBlockSizeAttr(param counter: int) {
compilerError("'@gpu.blockSize' attribute must have exactly one argument: an integral value for the block size");
}
}
9 changes: 9 additions & 0 deletions runtime/include/chpl-gpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,15 @@ GPU_CUB_WRAP(DECL_ONE_SORT, keys);

#undef DECL_ONE_SORT

#else // HAS_GPU_LOCALE

// Provide a fallback for the chpl_assert_on_gpu function for non-GPU locales.
// This works exactly the same as the standard one.

static inline void chpl_assert_on_gpu(int32_t ln, int32_t fn) {
chpl_error("assertOnGpu() failed", ln, fn);
}

#endif // HAS_GPU_LOCALE

#ifdef __cplusplus
Expand Down
2 changes: 2 additions & 0 deletions test/compflags/ferguson/print-module-resolution.good
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,8 @@ ChapelStaticVars
from print-module-resolution.ChapelStandard.ChapelStaticVars
ChapelRemoteVars
from print-module-resolution.ChapelStandard.ChapelRemoteVars
AutoGpu
from print-module-resolution.ChapelStandard.AutoGpu
stopInitCommDiags
from print-module-resolution.ChapelStandard.stopInitCommDiags
ChapelStandard
Expand Down
4 changes: 4 additions & 0 deletions test/gpu/native/assertEligibleNoRuntime.chpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
@gpu.assertEligible
var A = foreach i in 1..100 do i;

writeln("all is good; '@gpu.assertEligible' doesn't require GPU execution.");
1 change: 1 addition & 0 deletions test/gpu/native/assertEligibleNoRuntime.good
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
all is good; '@gpu.assertEligible' doesn't require GPU execution.
3 changes: 3 additions & 0 deletions test/gpu/native/assertOnNotGpuEligible.1.good
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
assertOnNotGpuEligible.chpl:15: In function 'funcMarkedNotGpuizableThatTriesToGpuize':
assertOnNotGpuEligible.chpl:17: error: Loop is marked with @gpu.assertEligible but is not eligible for execution on a GPU
assertOnNotGpuEligible.chpl:15: note: parent function disallows execution on a GPU
3 changes: 3 additions & 0 deletions test/gpu/native/assertOnNotGpuEligible.2.good
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
assertOnNotGpuEligible.chpl:32: error: Loop is marked with @gpu.assertEligible but is not eligible for execution on a GPU
assertOnNotGpuEligible.chpl:23: note: function is marked as not eligible for GPU execution
assertOnNotGpuEligible.chpl:33: note: reached via call to 'funcMarkedNotGpuizable' in loop body here
3 changes: 3 additions & 0 deletions test/gpu/native/assertOnNotGpuEligible.3.good
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
assertOnNotGpuEligible.chpl:39: error: Loop is marked with @gpu.assertEligible but is not eligible for execution on a GPU
assertOnNotGpuEligible.chpl:12: note: called function has outer var access
assertOnNotGpuEligible.chpl:40: note: reached via call to 'usesOutsideVar' in loop body here
Empty file.
Loading

0 comments on commit d301f0a

Please sign in to comment.