Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVBug 2818198] Thrust and CUB have ongoing issues with large indices (problem size > INT32_MAX) #744

Open
alliepiper opened this issue Sep 4, 2020 · 4 comments
Labels
nvbug Has an associated internal NVIDIA NVBug. thrust For all items related to Thrust.

Comments

@alliepiper
Copy link
Collaborator

Thrust algorithms seem to use 32bit signed integers internally for index calculations. UPDATE: This has been fixed for most algorithms, but not for sort and possibly others.

NVBug 2818198

@alliepiper alliepiper added the nvbug Has an associated internal NVIDIA NVBug. label Sep 23, 2020
@alliepiper
Copy link
Collaborator Author

@griwes Can you fill me in on the current status of this problem in thrust/cub?

  • Which algorithms have been fixed?
  • Which algorithms are known to still broken?
  • Is there a preferred approach to fixing these, e.g. existing helpers, etc? Is there a good example that shows how the previous fixes were made?

@JohnZed
Copy link

JohnZed commented Oct 13, 2020

@allisonvacanti or @griwes were you able to find a list of which algorithms have this issue outstanding and which have been fixed in public releases?

@alliepiper
Copy link
Collaborator Author

alliepiper commented Oct 13, 2020

@JohnZed I don't have a comprehensive list, and I'm not sure whether Michal has one. I'm using this issue as a place to collect the various reported issues so we can start working through them systematically.

I also opened a related issue in CUB, NVIDIA/cub#212 since I'd like to just fix this at the source of the problem (proactively make the cub device algorithms more robust wrt problem size) instead of reactively implementing workarounds in Thrust.

@alliepiper
Copy link
Collaborator Author

My current plan forward for this:

  1. Fix the CUB device algorithms by addressing Transparent support for 64-bit indexing in device algorithms cub#212.
  2. Move all Thrust internal cuda_cub algorithms into CUB, taking care to handle the problem size and choose appropriate indicies.
  3. Add tests for all Thrust and CUB algorithms to check that they work with large problem sizes.

@alliepiper alliepiper changed the title [NV 2818198] 32-bit signed integers are used internally for index calculations, which causes failures with large problem sizes for sorting (and possibly other things) [NVBug 2818198] Thrust and CUB have ongoing issues with large indices (problem size > INT32_MAX) Oct 19, 2020
@jrhemstad jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023
@alliepiper alliepiper removed their assignment Feb 23, 2023
e-kayrakli referenced this issue in chapel-lang/chapel Nov 6, 2023
Addresses Step 1 in #23324
Resolves Cray/chapel-private#5513

This PR adds several `gpu*Reduce` functions to perform whole-array
reductions for arrays that are allocated in GPU memory. The functions
added cover Chapel's default reductions and they are named:

- `gpuSumReduce`
- `gpuMinReduce`
- `gpuMaxReduce`
- `gpuMinLocReduce`
- `gpuMaxLocReduce`

### NVIDIA implementation

This is done by wrapping CUB: https://nvlabs.github.io/cub/. CUB is a
C++ header-only template library. We wrap it with some macro magic in
the runtime. This currently increases runtime build time by quite a bit.
We might consider wrapping the functions from the library in non-inline
helpers that can help a bit.

### AMD implementation

AMD has hipCUB: https://rocm.docs.amd.com/projects/hipCUB/en/latest/ and
a slightly lower-level, AMD-only rocPRIM:
https://rocm.docs.amd.com/projects/rocPRIM/en/latest/. I couldn't get
either to work. I can't run a simple HIP reproducer based off of one of
their tests. I might be doing something wrong in compilation, but what I
am getting is a segfault in the launched kernel (or `hipLaunchKernel`).
I filed ROCm/hipCUB#304, but
haven't received a response quick enough to address in this PR.

This is really unfortunate, but maybe we'll have a better/native
reduction support soon and we can cover AMD there, too.

### Implementation details:

- `chpl_gpu_X_reduce_Y` functions are added to the main runtime
interface via macros. Here, X is the reduction kind, Y is the data type.
This one prints debugging output, finds the stream to run the reduction
on and calls its `impl` cousin.
- `chpl_gpu_impl_X_reduce_Y` are added to the implementation layer in a
similar fashion.
- These functions are added to `gpu/Z/gpu-Z-reduce.cc` in the runtime
where Z is either `nvidia` or `amd`
- AMD versions are mostly "implemented" the way I think they should
work, but because of the segfaults that I was getting, they are in the
`else` branch of an `#if 1` at the moment.
- The module code has a private `doGpuReduce` that calls the appropriate
runtime function for any reduction type. This function has some
similarities to how atomics are implemented. Unfortunately the
interfaces are different enough that I can't come up with a good way to
refactor some of the helpers. All the reduction helpers are nested in
`doGpuReduce` to avoid confusion.
- To workaround a CUB limitation that prevents reducing arrays whose
size is close to `max(int(32))`, the implementation runs the underlying
CUB implementation with at most 2B elements at a time and stitches the
result on the host, if it ends up calling the implementation multiple
times. The underlying issue is captured in:
  - https://github.com/NVIDIA/thrust/issues/1271
  - NVIDIA/cccl#49

### Future work:
- Keep an eye on the AMD bug report
- Implement a fall back when we're ready to run an in-house reduction if
the bug remains unresolved.

[Reviewed by @stonea]

### Test Status
- [x] nvidia
- [x] amd
- [x] flat `make check`
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 8, 2023
@jarmak-nv jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nvbug Has an associated internal NVIDIA NVBug. thrust For all items related to Thrust.
Projects
Status: Todo
Development

No branches or pull requests

4 participants