-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GPU-based reductions #23324
Comments
Addresses Step 1 in #23324 Resolves Cray/chapel-private#5513 This PR adds several `gpu*Reduce` functions to perform whole-array reductions for arrays that are allocated in GPU memory. The functions added cover Chapel's default reductions and they are named: - `gpuSumReduce` - `gpuMinReduce` - `gpuMaxReduce` - `gpuMinLocReduce` - `gpuMaxLocReduce` ### NVIDIA implementation This is done by wrapping CUB: https://nvlabs.github.io/cub/. CUB is a C++ header-only template library. We wrap it with some macro magic in the runtime. This currently increases runtime build time by quite a bit. We might consider wrapping the functions from the library in non-inline helpers that can help a bit. ### AMD implementation AMD has hipCUB: https://rocm.docs.amd.com/projects/hipCUB/en/latest/ and a slightly lower-level, AMD-only rocPRIM: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/. I couldn't get either to work. I can't run a simple HIP reproducer based off of one of their tests. I might be doing something wrong in compilation, but what I am getting is a segfault in the launched kernel (or `hipLaunchKernel`). I filed ROCm/hipCUB#304, but haven't received a response quick enough to address in this PR. This is really unfortunate, but maybe we'll have a better/native reduction support soon and we can cover AMD there, too. ### Implementation details: - `chpl_gpu_X_reduce_Y` functions are added to the main runtime interface via macros. Here, X is the reduction kind, Y is the data type. This one prints debugging output, finds the stream to run the reduction on and calls its `impl` cousin. - `chpl_gpu_impl_X_reduce_Y` are added to the implementation layer in a similar fashion. - These functions are added to `gpu/Z/gpu-Z-reduce.cc` in the runtime where Z is either `nvidia` or `amd` - AMD versions are mostly "implemented" the way I think they should work, but because of the segfaults that I was getting, they are in the `else` branch of an `#if 1` at the moment. - The module code has a private `doGpuReduce` that calls the appropriate runtime function for any reduction type. This function has some similarities to how atomics are implemented. Unfortunately the interfaces are different enough that I can't come up with a good way to refactor some of the helpers. All the reduction helpers are nested in `doGpuReduce` to avoid confusion. - To workaround a CUB limitation that prevents reducing arrays whose size is close to `max(int(32))`, the implementation runs the underlying CUB implementation with at most 2B elements at a time and stitches the result on the host, if it ends up calling the implementation multiple times. The underlying issue is captured in: - https://github.com/NVIDIA/thrust/issues/1271 - NVIDIA/cccl#49 ### Future work: - Keep an eye on the AMD bug report - Implement a fall back when we're ready to run an in-house reduction if the bug remains unresolved. [Reviewed by @stonea] ### Test Status - [x] nvidia - [x] amd - [x] flat `make check`
I understand from this issue that reductions on GPUs are not supported in Chapel 1.32. As this isn't well documented in the GPU programming technote, I wanted to document an interesting / weird behaviour I noticed with proc computeSum(loc) {
on loc {
var sum: int = 0;
forall a in 0..#10 with (+ reduce sum) {
sum += a;
writeln(sum);
}
writeln(sum);
}
}
if here.gpus.size > 0 then computeSum(here.gpus[0]);
else computeSum(here); On CPU, as expected, the final number printed is 45. (As It was surprising to me that simply printing the intermediate results would change the final result of the computation. I assume this behaviour will disappear in Chapel 1.33, but I wanted to record it here in case other users hit the same issue. |
@e-kayrakli can probably say much more authoritatively, but my guess is that the writeln() causes the loop not to be executed on the GPU which is why it works. Meanwhile, when it's executed on the GPU, I'm guessing we're not implementing the reduce intent properly (which wouldn't be completely surprising since GPUs are driven by foreach loops and foreach loops don't have good with-clause support yet... though we're working on that. In any case, it'd be nice if cases like this would generate a compiler error rather than the wrong answer, at least until they're working properly. I suspect you've just poked into a corner that others haven't yet—thanks for pointing it out, Josh. Would you be up for creating a bug issue for this? |
I also note that if I add |
That's with the writeln() in the loop? That definitely seems surprising/confusing... |
Brad's assessment sounded right to me. I thought we had a issue capturing this wrong result behavior but it is from way past and it is in our private tracker.
I agree, I'll take a quick look. |
I can't reproduce this on proc computeSum(loc) {
on loc {
var sum: int = 0;
@assertOnGpu
forall a in 0..#10 with (+ reduce sum) {
sum += a;
writeln(sum);
}
writeln(sum);
}
}
if here.gpus.size > 0 then computeSum(here.gpus[0]);
else computeSum(here); Produces:
|
Sorry - you're right - I tried it on a version of the code where I'd already removed the |
I am filing this on behalf of a user. We lacked reduction support on GPUs so far, mainly because it got de-prioritized over other user-driven requests. But I think we can start working on reductions soon. Here's how I think we should tackle this:
Step 1: Provide reduction functions for simple cases to move forward.
We can think of
gpuSumReduce(x: [])
et al. that can be used as a drop-in replacement for simple+ reduce x
wherex
is on the GPU. We can makegpuSumReduce(x: [])
portable between the host and the device. I imagine the device side of the implementation can use CUB for NVIDIA and its analogue (hoping that there's one) for AMD. EssentiallygpuSumReduce
will be a wrapper for CUB's reduction support. That's probably as best as we can get in terms of performance. Internal note for this: https://github.com/Cray/chapel-private/issues/5330Optional Step 2: Lower
+ reduce x
intogpuSumReduce
We can imagine lowering such reductions into our GPU support function. Otherwise, I believe that will lower into a forall with reduce intent. I think as long as the non-gpu branch in
gpuSumReduce
is still+ reduce x
, we can probably pull this off with no performance impact.We can also do this with GPU specialization (
--gpu-specialization
)Step 3: Get reduce intents to work
This is our ultimate goal. It depends on #18500, which is a big lift in its own right, but it is already in the works.
The text was updated successfully, but these errors were encountered: