Support GPU-based reductions #23324

e-kayrakli · 2023-09-12T16:05:46Z

I am filing this on behalf of a user. We lacked reduction support on GPUs so far, mainly because it got de-prioritized over other user-driven requests. But I think we can start working on reductions soon. Here's how I think we should tackle this:

Step 1: Provide reduction functions for simple cases to move forward.

We can think of gpuSumReduce(x: []) et al. that can be used as a drop-in replacement for simple + reduce x where x is on the GPU. We can make gpuSumReduce(x: []) portable between the host and the device. I imagine the device side of the implementation can use CUB for NVIDIA and its analogue (hoping that there's one) for AMD. Essentially gpuSumReduce will be a wrapper for CUB's reduction support. That's probably as best as we can get in terms of performance. Internal note for this: https://github.com/Cray/chapel-private/issues/5330

Optional Step 2: Lower `+ reduce x` into `gpuSumReduce`

We can imagine lowering such reductions into our GPU support function. Otherwise, I believe that will lower into a forall with reduce intent. I think as long as the non-gpu branch in gpuSumReduce is still + reduce x, we can probably pull this off with no performance impact.

We can also do this with GPU specialization (--gpu-specialization)

Step 3: Get reduce intents to work

This is our ultimate goal. It depends on #18500, which is a big lift in its own right, but it is already in the works.

The text was updated successfully, but these errors were encountered:

@stonea

Addresses Step 1 in #23324 Resolves Cray/chapel-private#5513 This PR adds several `gpu*Reduce` functions to perform whole-array reductions for arrays that are allocated in GPU memory. The functions added cover Chapel's default reductions and they are named: - `gpuSumReduce` - `gpuMinReduce` - `gpuMaxReduce` - `gpuMinLocReduce` - `gpuMaxLocReduce` ### NVIDIA implementation This is done by wrapping CUB: https://nvlabs.github.io/cub/. CUB is a C++ header-only template library. We wrap it with some macro magic in the runtime. This currently increases runtime build time by quite a bit. We might consider wrapping the functions from the library in non-inline helpers that can help a bit. ### AMD implementation AMD has hipCUB: https://rocm.docs.amd.com/projects/hipCUB/en/latest/ and a slightly lower-level, AMD-only rocPRIM: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/. I couldn't get either to work. I can't run a simple HIP reproducer based off of one of their tests. I might be doing something wrong in compilation, but what I am getting is a segfault in the launched kernel (or `hipLaunchKernel`). I filed ROCm/hipCUB#304, but haven't received a response quick enough to address in this PR. This is really unfortunate, but maybe we'll have a better/native reduction support soon and we can cover AMD there, too. ### Implementation details: - `chpl_gpu_X_reduce_Y` functions are added to the main runtime interface via macros. Here, X is the reduction kind, Y is the data type. This one prints debugging output, finds the stream to run the reduction on and calls its `impl` cousin. - `chpl_gpu_impl_X_reduce_Y` are added to the implementation layer in a similar fashion. - These functions are added to `gpu/Z/gpu-Z-reduce.cc` in the runtime where Z is either `nvidia` or `amd` - AMD versions are mostly "implemented" the way I think they should work, but because of the segfaults that I was getting, they are in the `else` branch of an `#if 1` at the moment. - The module code has a private `doGpuReduce` that calls the appropriate runtime function for any reduction type. This function has some similarities to how atomics are implemented. Unfortunately the interfaces are different enough that I can't come up with a good way to refactor some of the helpers. All the reduction helpers are nested in `doGpuReduce` to avoid confusion. - To workaround a CUB limitation that prevents reducing arrays whose size is close to `max(int(32))`, the implementation runs the underlying CUB implementation with at most 2B elements at a time and stitches the result on the host, if it ends up calling the implementation multiple times. The underlying issue is captured in: - https://github.com/NVIDIA/thrust/issues/1271 - NVIDIA/cccl#49 ### Future work: - Keep an eye on the AMD bug report - Implement a fall back when we're ready to run an in-house reduction if the bug remains unresolved. [Reviewed by @stonea] ### Test Status - [x] nvidia - [x] amd - [x] flat `make check`

milthorpe · 2023-11-09T22:47:06Z

I understand from this issue that reductions on GPUs are not supported in Chapel 1.32. As this isn't well documented in the GPU programming technote, I wanted to document an interesting / weird behaviour I noticed with forall with a reduce intent on GPU. My test program is as follows:

proc computeSum(loc) {
    on loc {
        var sum: int = 0;
        forall a in 0..#10 with (+ reduce sum) {
            sum += a;
            writeln(sum);
        }
        writeln(sum);
    }
}

if here.gpus.size > 0 then computeSum(here.gpus[0]);
else computeSum(here);

On CPU, as expected, the final number printed is 45. (As forall intents accumulate to a per-task reduction variable, I believe the output of the writeln statement inside the forall loop is undefined. On my system, it prints the numbers 0..9 in random order.)
On GPU, the final number printed by the code above is also 45, however, if I remove the first writeln statement from inside the loop, the final number printed is now 0.

It was surprising to me that simply printing the intermediate results would change the final result of the computation. I assume this behaviour will disappear in Chapel 1.33, but I wanted to record it here in case other users hit the same issue.

bradcray · 2023-11-09T22:54:09Z

@e-kayrakli can probably say much more authoritatively, but my guess is that the writeln() causes the loop not to be executed on the GPU which is why it works. Meanwhile, when it's executed on the GPU, I'm guessing we're not implementing the reduce intent properly (which wouldn't be completely surprising since GPUs are driven by foreach loops and foreach loops don't have good with-clause support yet... though we're working on that.

In any case, it'd be nice if cases like this would generate a compiler error rather than the wrong answer, at least until they're working properly. I suspect you've just poked into a corner that others haven't yet—thanks for pointing it out, Josh. Would you be up for creating a bug issue for this?

milthorpe · 2023-11-09T23:23:31Z

I also note that if I add @assertOnGPU to the forall loop, I do not get a compile error with CHPL_LOCALE_MODEL=gpu, i.e. it looks to me as a user that the loop compiled correctly for GPU.
I'm happy to create a bug issue along the lines of 'forall reductions fail silently for GPU'; as you suggest, it would be nicer if they noisily failed to compile.

bradcray · 2023-11-09T23:29:16Z

I also note that if I add @assertOnGPU to the forall loop, I do not get a compile error with CHPL_LOCALE_MODEL=gpu

That's with the writeln() in the loop? That definitely seems surprising/confusing...

e-kayrakli · 2023-11-10T00:12:33Z

Brad's assessment sounded right to me. I thought we had a issue capturing this wrong result behavior but it is from way past and it is in our private tracker.

That's with the writeln() in the loop? That definitely seems surprising/confusing...

I agree, I'll take a quick look.

e-kayrakli · 2023-11-10T00:22:38Z

I also note that if I add @assertOnGPU to the forall loop, I do not get a compile error with CHPL_LOCALE_MODEL=gpu, i.e. it looks to me as a user that the loop compiled correctly for GPU.

I can't reproduce this on main.

proc computeSum(loc) {
  on loc {
    var sum: int = 0;
    @assertOnGpu
    forall a in 0..#10 with (+ reduce sum) {
      sum += a;
      writeln(sum);
    }
    writeln(sum);
  }
}

if here.gpus.size > 0 then computeSum(here.gpus[0]);
else computeSum(here);

Produces:

warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly
$CHPL_HOME/forallReduce.chpl:1: In function 'computeSum':
$CHPL_HOME/forallReduce.chpl:5: error: Loop is marked with @assertOnGpu but is not eligible for execution on a GPU
$CHPL_HOME/forallReduce.chpl:7: note: call has outer var access
  $CHPL_HOME/forallReduce.chpl:13: called as computeSum(loc: locale)
note: generic instantiations are underlined in the above callstack

milthorpe · 2023-11-10T01:40:26Z

Sorry - you're right - I tried it on a version of the code where I'd already removed the writeln. With the writeln, I get the same error Engin observed above.

e-kayrakli · 2024-06-24T18:58:31Z

This is done via #24787

Follow ups are in #24932

e-kayrakli added type: Unimplemented Feature user issue area: GPU Support labels Sep 12, 2023

e-kayrakli mentioned this issue Oct 7, 2023

Support warp-level functions on GPUs #23601

Open

e-kayrakli self-assigned this Oct 21, 2023

e-kayrakli mentioned this issue Oct 23, 2023

Add initial support for whole-array reduction on NVIDIA GPUs #23689

Merged

3 tasks

milthorpe mentioned this issue Nov 10, 2023

forall reductions fail silently for GPU #23827

Closed

e-kayrakli closed this as completed Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GPU-based reductions #23324

Support GPU-based reductions #23324

e-kayrakli commented Sep 12, 2023 •

edited

Loading

milthorpe commented Nov 9, 2023

bradcray commented Nov 9, 2023

milthorpe commented Nov 9, 2023

bradcray commented Nov 9, 2023

e-kayrakli commented Nov 10, 2023

e-kayrakli commented Nov 10, 2023

milthorpe commented Nov 10, 2023

e-kayrakli commented Jun 24, 2024

Support GPU-based reductions #23324

Support GPU-based reductions #23324

Comments

e-kayrakli commented Sep 12, 2023 • edited Loading

Step 1: Provide reduction functions for simple cases to move forward.

Optional Step 2: Lower + reduce x into gpuSumReduce

Step 3: Get reduce intents to work

milthorpe commented Nov 9, 2023

bradcray commented Nov 9, 2023

milthorpe commented Nov 9, 2023

bradcray commented Nov 9, 2023

e-kayrakli commented Nov 10, 2023

e-kayrakli commented Nov 10, 2023

milthorpe commented Nov 10, 2023

e-kayrakli commented Jun 24, 2024

e-kayrakli commented Sep 12, 2023 •

edited

Loading

Optional Step 2: Lower `+ reduce x` into `gpuSumReduce`