[NVBug 2818198] Thrust and CUB have ongoing issues with large indices (problem size > INT32_MAX) #744

alliepiper · 2020-09-04T16:52:28Z

Thrust algorithms seem to use 32bit signed integers internally for index calculations. UPDATE: This has been fixed for most algorithms, but not for sort and possibly others.

NVBug 2818198

alliepiper · 2020-09-23T18:04:30Z

@griwes Can you fill me in on the current status of this problem in thrust/cub?

Which algorithms have been fixed?
Which algorithms are known to still broken?
Is there a preferred approach to fixing these, e.g. existing helpers, etc? Is there a good example that shows how the previous fixes were made?

JohnZed · 2020-10-13T21:04:07Z

@allisonvacanti or @griwes were you able to find a list of which algorithms have this issue outstanding and which have been fixed in public releases?

alliepiper · 2020-10-13T21:51:17Z

@JohnZed I don't have a comprehensive list, and I'm not sure whether Michal has one. I'm using this issue as a place to collect the various reported issues so we can start working through them systematically.

I also opened a related issue in CUB, NVIDIA/cub#212 since I'd like to just fix this at the source of the problem (proactively make the cub device algorithms more robust wrt problem size) instead of reactively implementing workarounds in Thrust.

alliepiper · 2020-10-19T16:49:22Z

My current plan forward for this:

Fix the CUB device algorithms by addressing Transparent support for 64-bit indexing in device algorithms cub#212.
Move all Thrust internal cuda_cub algorithms into CUB, taking care to handle the problem size and choose appropriate indicies.
Add tests for all Thrust and CUB algorithms to check that they work with large problem sizes.

@stonea

Addresses Step 1 in #23324 Resolves Cray/chapel-private#5513 This PR adds several `gpu*Reduce` functions to perform whole-array reductions for arrays that are allocated in GPU memory. The functions added cover Chapel's default reductions and they are named: - `gpuSumReduce` - `gpuMinReduce` - `gpuMaxReduce` - `gpuMinLocReduce` - `gpuMaxLocReduce` ### NVIDIA implementation This is done by wrapping CUB: https://nvlabs.github.io/cub/. CUB is a C++ header-only template library. We wrap it with some macro magic in the runtime. This currently increases runtime build time by quite a bit. We might consider wrapping the functions from the library in non-inline helpers that can help a bit. ### AMD implementation AMD has hipCUB: https://rocm.docs.amd.com/projects/hipCUB/en/latest/ and a slightly lower-level, AMD-only rocPRIM: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/. I couldn't get either to work. I can't run a simple HIP reproducer based off of one of their tests. I might be doing something wrong in compilation, but what I am getting is a segfault in the launched kernel (or `hipLaunchKernel`). I filed ROCm/hipCUB#304, but haven't received a response quick enough to address in this PR. This is really unfortunate, but maybe we'll have a better/native reduction support soon and we can cover AMD there, too. ### Implementation details: - `chpl_gpu_X_reduce_Y` functions are added to the main runtime interface via macros. Here, X is the reduction kind, Y is the data type. This one prints debugging output, finds the stream to run the reduction on and calls its `impl` cousin. - `chpl_gpu_impl_X_reduce_Y` are added to the implementation layer in a similar fashion. - These functions are added to `gpu/Z/gpu-Z-reduce.cc` in the runtime where Z is either `nvidia` or `amd` - AMD versions are mostly "implemented" the way I think they should work, but because of the segfaults that I was getting, they are in the `else` branch of an `#if 1` at the moment. - The module code has a private `doGpuReduce` that calls the appropriate runtime function for any reduction type. This function has some similarities to how atomics are implemented. Unfortunately the interfaces are different enough that I can't come up with a good way to refactor some of the helpers. All the reduction helpers are nested in `doGpuReduce` to avoid confusion. - To workaround a CUB limitation that prevents reducing arrays whose size is close to `max(int(32))`, the implementation runs the underlying CUB implementation with at most 2B elements at a time and stitches the result on the host, if it ends up calling the implementation multiple times. The underlying issue is captured in: - https://github.com/NVIDIA/thrust/issues/1271 - NVIDIA/cccl#49 ### Future work: - Keep an eye on the AMD bug report - Implement a fall back when we're ready to run an in-house reduction if the bug remains unresolved. [Reviewed by @stonea] ### Test Status - [x] nvidia - [x] amd - [x] flat `make check`

alliepiper added the nvbug Has an associated internal NVIDIA NVBug. label Sep 23, 2020

alliepiper assigned griwes Sep 23, 2020

alliepiper mentioned this issue Oct 19, 2020

NVBug: 3075796 [PyTorch] temp_storage_bytes overflows in InclusiveScan for size_cub value close to int32 max NVIDIA/cub#221

Closed

alliepiper assigned alliepiper and unassigned griwes Oct 19, 2020

leofang mentioned this issue Nov 11, 2020

CUB device_reduce and size_t cupy/cupy#3309

Open

lilohuang mentioned this issue Feb 24, 2022

thrust partition fails for size_t{3} * 1024 * 1024 * 1024 elements NVIDIA/thrust#1618

Closed

jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023

alliepiper removed their assignment Feb 23, 2023

This was referenced Feb 24, 2023

thrust parallel for kernel failed on num_items > uint32_max NVIDIA/thrust#967

Closed

thrust::reduce does not handle large input ranges correctly NVIDIA/thrust#655

Closed

This was referenced Mar 7, 2023

thrust::merge fails for > 2.1B keys NVIDIA/thrust#1454

Closed

thrust::sort fails for > 2.1B keys NVIDIA/thrust#1453

Closed

GregoryKimball mentioned this issue Apr 17, 2023

[FEA] Add 64-bit size type option at build-time for libcudf rapidsai/cudf#13159

Open

10 tasks

jrhemstad mentioned this issue Jun 28, 2023

[EPIC] Thrust large input support #49

Open

3 tasks

pca006132 mentioned this issue Aug 28, 2023

allow double precision elalish/manifold#542

Closed

e-kayrakli mentioned this issue Oct 23, 2023

Add initial support for whole-array reduction on NVIDIA GPUs chapel-lang/chapel#23689

Merged

3 tasks

github-project-automation bot added this to CCCL Nov 8, 2023

github-project-automation bot moved this to Todo in CCCL Nov 8, 2023

jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVBug 2818198] Thrust and CUB have ongoing issues with large indices (problem size > INT32_MAX) #744

[NVBug 2818198] Thrust and CUB have ongoing issues with large indices (problem size > INT32_MAX) #744

alliepiper commented Sep 4, 2020

alliepiper commented Sep 23, 2020

JohnZed commented Oct 13, 2020

alliepiper commented Oct 13, 2020 •

edited

Loading

alliepiper commented Oct 19, 2020

[NVBug 2818198] Thrust and CUB have ongoing issues with large indices (problem size > INT32_MAX) #744

[NVBug 2818198] Thrust and CUB have ongoing issues with large indices (problem size > INT32_MAX) #744

Comments

alliepiper commented Sep 4, 2020

alliepiper commented Sep 23, 2020

JohnZed commented Oct 13, 2020

alliepiper commented Oct 13, 2020 • edited Loading

alliepiper commented Oct 19, 2020

alliepiper commented Oct 13, 2020 •

edited

Loading