Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GPU-based reductions #23324

Closed
e-kayrakli opened this issue Sep 12, 2023 · 8 comments
Closed

Support GPU-based reductions #23324

e-kayrakli opened this issue Sep 12, 2023 · 8 comments

Comments

@e-kayrakli
Copy link
Contributor

e-kayrakli commented Sep 12, 2023

I am filing this on behalf of a user. We lacked reduction support on GPUs so far, mainly because it got de-prioritized over other user-driven requests. But I think we can start working on reductions soon. Here's how I think we should tackle this:

Step 1: Provide reduction functions for simple cases to move forward.

We can think of gpuSumReduce(x: []) et al. that can be used as a drop-in replacement for simple + reduce x where x is on the GPU. We can make gpuSumReduce(x: []) portable between the host and the device. I imagine the device side of the implementation can use CUB for NVIDIA and its analogue (hoping that there's one) for AMD. Essentially gpuSumReduce will be a wrapper for CUB's reduction support. That's probably as best as we can get in terms of performance. Internal note for this: https://github.com/Cray/chapel-private/issues/5330

Optional Step 2: Lower + reduce x into gpuSumReduce

We can imagine lowering such reductions into our GPU support function. Otherwise, I believe that will lower into a forall with reduce intent. I think as long as the non-gpu branch in gpuSumReduce is still + reduce x, we can probably pull this off with no performance impact.

We can also do this with GPU specialization (--gpu-specialization)

Step 3: Get reduce intents to work

This is our ultimate goal. It depends on #18500, which is a big lift in its own right, but it is already in the works.

@e-kayrakli e-kayrakli self-assigned this Oct 21, 2023
e-kayrakli added a commit that referenced this issue Nov 6, 2023
Addresses Step 1 in #23324
Resolves Cray/chapel-private#5513

This PR adds several `gpu*Reduce` functions to perform whole-array
reductions for arrays that are allocated in GPU memory. The functions
added cover Chapel's default reductions and they are named:

- `gpuSumReduce`
- `gpuMinReduce`
- `gpuMaxReduce`
- `gpuMinLocReduce`
- `gpuMaxLocReduce`

### NVIDIA implementation

This is done by wrapping CUB: https://nvlabs.github.io/cub/. CUB is a
C++ header-only template library. We wrap it with some macro magic in
the runtime. This currently increases runtime build time by quite a bit.
We might consider wrapping the functions from the library in non-inline
helpers that can help a bit.

### AMD implementation

AMD has hipCUB: https://rocm.docs.amd.com/projects/hipCUB/en/latest/ and
a slightly lower-level, AMD-only rocPRIM:
https://rocm.docs.amd.com/projects/rocPRIM/en/latest/. I couldn't get
either to work. I can't run a simple HIP reproducer based off of one of
their tests. I might be doing something wrong in compilation, but what I
am getting is a segfault in the launched kernel (or `hipLaunchKernel`).
I filed ROCm/hipCUB#304, but
haven't received a response quick enough to address in this PR.

This is really unfortunate, but maybe we'll have a better/native
reduction support soon and we can cover AMD there, too.

### Implementation details:

- `chpl_gpu_X_reduce_Y` functions are added to the main runtime
interface via macros. Here, X is the reduction kind, Y is the data type.
This one prints debugging output, finds the stream to run the reduction
on and calls its `impl` cousin.
- `chpl_gpu_impl_X_reduce_Y` are added to the implementation layer in a
similar fashion.
- These functions are added to `gpu/Z/gpu-Z-reduce.cc` in the runtime
where Z is either `nvidia` or `amd`
- AMD versions are mostly "implemented" the way I think they should
work, but because of the segfaults that I was getting, they are in the
`else` branch of an `#if 1` at the moment.
- The module code has a private `doGpuReduce` that calls the appropriate
runtime function for any reduction type. This function has some
similarities to how atomics are implemented. Unfortunately the
interfaces are different enough that I can't come up with a good way to
refactor some of the helpers. All the reduction helpers are nested in
`doGpuReduce` to avoid confusion.
- To workaround a CUB limitation that prevents reducing arrays whose
size is close to `max(int(32))`, the implementation runs the underlying
CUB implementation with at most 2B elements at a time and stitches the
result on the host, if it ends up calling the implementation multiple
times. The underlying issue is captured in:
  - https://github.com/NVIDIA/thrust/issues/1271
  - NVIDIA/cccl#49

### Future work:
- Keep an eye on the AMD bug report
- Implement a fall back when we're ready to run an in-house reduction if
the bug remains unresolved.

[Reviewed by @stonea]

### Test Status
- [x] nvidia
- [x] amd
- [x] flat `make check`
@milthorpe
Copy link
Contributor

I understand from this issue that reductions on GPUs are not supported in Chapel 1.32. As this isn't well documented in the GPU programming technote, I wanted to document an interesting / weird behaviour I noticed with forall with a reduce intent on GPU. My test program is as follows:

proc computeSum(loc) {
    on loc {
        var sum: int = 0;
        forall a in 0..#10 with (+ reduce sum) {
            sum += a;
            writeln(sum);
        }
        writeln(sum);
    }
}

if here.gpus.size > 0 then computeSum(here.gpus[0]);
else computeSum(here);

On CPU, as expected, the final number printed is 45. (As forall intents accumulate to a per-task reduction variable, I believe the output of the writeln statement inside the forall loop is undefined. On my system, it prints the numbers 0..9 in random order.)
On GPU, the final number printed by the code above is also 45, however, if I remove the first writeln statement from inside the loop, the final number printed is now 0.

It was surprising to me that simply printing the intermediate results would change the final result of the computation. I assume this behaviour will disappear in Chapel 1.33, but I wanted to record it here in case other users hit the same issue.

@bradcray
Copy link
Member

bradcray commented Nov 9, 2023

@e-kayrakli can probably say much more authoritatively, but my guess is that the writeln() causes the loop not to be executed on the GPU which is why it works. Meanwhile, when it's executed on the GPU, I'm guessing we're not implementing the reduce intent properly (which wouldn't be completely surprising since GPUs are driven by foreach loops and foreach loops don't have good with-clause support yet... though we're working on that.

In any case, it'd be nice if cases like this would generate a compiler error rather than the wrong answer, at least until they're working properly. I suspect you've just poked into a corner that others haven't yet—thanks for pointing it out, Josh. Would you be up for creating a bug issue for this?

@milthorpe
Copy link
Contributor

I also note that if I add @assertOnGPU to the forall loop, I do not get a compile error with CHPL_LOCALE_MODEL=gpu, i.e. it looks to me as a user that the loop compiled correctly for GPU.
I'm happy to create a bug issue along the lines of 'forall reductions fail silently for GPU'; as you suggest, it would be nicer if they noisily failed to compile.

@bradcray
Copy link
Member

bradcray commented Nov 9, 2023

I also note that if I add @assertOnGPU to the forall loop, I do not get a compile error with CHPL_LOCALE_MODEL=gpu

That's with the writeln() in the loop? That definitely seems surprising/confusing...

@e-kayrakli
Copy link
Contributor Author

Brad's assessment sounded right to me. I thought we had a issue capturing this wrong result behavior but it is from way past and it is in our private tracker.

That's with the writeln() in the loop? That definitely seems surprising/confusing...

I agree, I'll take a quick look.

@e-kayrakli
Copy link
Contributor Author

I also note that if I add @assertOnGPU to the forall loop, I do not get a compile error with CHPL_LOCALE_MODEL=gpu, i.e. it looks to me as a user that the loop compiled correctly for GPU.

I can't reproduce this on main.

proc computeSum(loc) {
  on loc {
    var sum: int = 0;
    @assertOnGpu
    forall a in 0..#10 with (+ reduce sum) {
      sum += a;
      writeln(sum);
    }
    writeln(sum);
  }
}

if here.gpus.size > 0 then computeSum(here.gpus[0]);
else computeSum(here);

Produces:

warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly
$CHPL_HOME/forallReduce.chpl:1: In function 'computeSum':
$CHPL_HOME/forallReduce.chpl:5: error: Loop is marked with @assertOnGpu but is not eligible for execution on a GPU
$CHPL_HOME/forallReduce.chpl:7: note: call has outer var access
  $CHPL_HOME/forallReduce.chpl:13: called as computeSum(loc: locale)
note: generic instantiations are underlined in the above callstack

@milthorpe
Copy link
Contributor

Sorry - you're right - I tried it on a version of the code where I'd already removed the writeln. With the writeln, I get the same error Engin observed above.

@e-kayrakli
Copy link
Contributor Author

This is done via #24787

Follow ups are in #24932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants