CUDA 10 :: Thrust sort is throwing exception for device vector of 23330 float elements for gpu architecture 'compute_61' #936

itczar · 2018-10-19T21:22:34Z

For type float ,if no of elements ,let's say 2330 then there is no issue. But if the number is 23330,then thrust::sort is throwing exception saying "radix sort failed on 2nd step invalid argument".
Please help.
GRAPHICS CARD in use is P2000 and CUDA version is 10

brycelelbach · 2018-10-20T02:03:44Z

Hi,

This report does not have enough information to be actionable. Please read the guidelines here and provide an updated report.

itczar · 2018-10-21T19:34:39Z

Initially i thought it was problem related to memory size specifically but it does not seem so.

following code is not behaving correctly with compiler option -gencode arch=compute_61,code=compute_61

    int count = 23330;
    float *d_vec;
    cudaMalloc (&d_vec, sizeof(float)*count);              //not populating..just for testing
    cudaStreamSynchronize (gpuStream);
    checkCudaErrors(cudaGetLastError());
    try{
    thrust::sort(thrust::system::cuda::par.on(gpuStream), d_vec, d_vec+count);
    cudaStreamSynchronize(gpuStream);
    }
    catch(const std::runtime_error& re)
    {
        printf("EXCEPTION****************>>>>>>>>>>>>%s",re.what());
    }
    checkCudaErrors(cudaGetLastError());

**when it is run with thrust 1.9.3 ( shipped with CUDA 10) giving -gencode arch=compute_50,code=compute_50 -gencode arch=compute_61,code=compute_61 as compiler option, it is throwing exception "radix sort failed on 2nd step invalid argument"

Although when option -gencode arch=compute_61,code=compute_61 is removed and only option -gencode arch=compute_50,code=compute_50 is used, it is working fine.**

It seems some problem is related to option compute_61 for compute capability 6.1
This problem is 100% reproducible.
System Details-
OS - Linux redhat based (2.6.32-696.30.1.el6.x86_64)
GPU - Quadro P200 ( compute capability 6.1)
Language - C++
Compiler - g++,ICC

Please provide some workaround or previous stable thrust version compatible with CUDA 10 if possible so that related work may be unblocked.

itczar · 2018-10-21T20:14:30Z

i ran even with thrust 1.9.2 ( shipped with CUDA9.2) i.e. CUDA 10 + thrust 1.9.2 , it works fine without any issue for compute_61

nebojsaandjelkovic · 2019-01-24T07:43:45Z

I have the same problem with excliusive scan. Here is code:

   thrust::device_vector<int> vec(3);
    vec[0] = 10; vec[1] = 11; vec[2] = 12;
    thrust::exclusive_scan(thrust::device, vec.begin(), vec.end(), vec.begin());

and the error I am getting is:

C++ exception with description "scan failed on 2nd step: invalid argument" thrown in the test.

The thing is that this code is randomly failing and sometimes is succesful. I tried cuda 9.0, 9.2 and 10.0 all with arch 6.1 on Titan X GPU.

Anyone has solution? @itczar @brycelelbach

Cartoonman · 2019-12-07T00:00:25Z

Also having this issue, 6.1 CC GPU w/ CUDA 10.2. As others have said, this was noticed past CUDA 9.0.

Cartoonman · 2019-12-07T03:07:24Z

Update: After digging into the source of the thrown exception, it is happening inside dispatch_radix_sort.cuh. I fixed the Thrust Debug variable to true to see the invocation of the kernels.

In my particular case, I am running sort_by_key which, given the inputs, selects radix sort as the function. For small input sizes, the dispatch selects the single_tile_kernel, which runs without issue.

For larger input sizes, the dispatch selects the 'Normal problem size invocation' which uses a 3-step pass: an upsweep, a scan, and then a downsweep. In this call, the very first attempt to run upsweep_kernel throws the error.

The debug output for a ~1000k input size is:

Invoking upsweep_kernel<<<140, 96, 0, 163304944>>>(), 39 items per thread, 10 SM occupancy, current bit 0, bit_grain 6

It seems that the thread count of 96 is causing the problem. When I set it to the next lowest multiple of 32 (64), it executes the kernel without any issue (but the overall function fails eventually due to cudaErrorIllegalAddress issues, likely because the expected memory addresses no longer match).

This is running on a P4000 GPU on CUDA 10.2. I believe there must be an incorrect calculation being done in determining the correct number of threads per block, which is causing this error. Any insights from the nvidia team @brycelelbach? In the meantime I'm going to try to hunt for where this calculation is done to see if I can correct it manually.

Cartoonman · 2019-12-07T03:37:33Z

Update 2: Following what @itczar mentioned above, I compiled to different SM_XX (my code defaults to the GPU's max capable, which is 6.1)

When I forced it to compile at sm_60 instead of sm_61, sort_by_keys ran fine with the same kernel invocation inputs. My guess is that there is something wrong with how the calculations are being done for different arch compiles.

NaikoniuM · 2020-02-28T20:42:20Z

Is there any news on this topic? I ran into the same issues (thrust::sort_by_key), but trying to compile for different architectures did not result in working code. I can confirm @Cartoonman that something is going in in dispatch_radix_sort.cuh.

If this cannot be fixed, is there any any workaround in thrust or a different CUDA-based library I can use?

Windows 10, CUDA 10.2
RTX 6000 with compute capability 7.5

The call of the thrust function looks something like this:
thrust::sort_by_key(thrust::cuda::par.on(myStream), thrust::raw_pointer_cast(d_allDistances), thrust::raw_pointer_cast(d_allDistances) + numberOfTargetCoordinates, thrust::raw_pointer_cast(d_allIndexes), thrust::less<real32_T>());

Thank you very much, any help is very much appreciated!

brycelelbach · 2020-02-28T21:26:07Z

We really need a minimal test case that reproduced the problem. Please see these guidelines to understand what we're looking for. I haven't been able to reproduce this myself yet.

FabianSchuetze · 2021-04-01T14:25:48Z

I have the same issue as mentioned above. When I use

thrust::sort_by_key(thrust::device, pbegin, pend, ibegin);

on a QuadroT1000 the program reports the error mentioned above depending on the input size. If I want to sort 102,499 elements the program runs fine. If I instead choose to sort 102,500 elements, the program fails with the message:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  radix_sort: failed on 2nd step: cudaErrorInvalidConfiguration: invalid configuration argument

I am working on Ubuntu 20.04 and nvcc --version is: release 10.1, V10.1.243. Does somebody know what to do in this case?

alliepiper · 2021-10-08T18:05:51Z

We've been trying to find a repro for #936 for a while, but haven't been able to replicate / debug it to figure out what's going on. If anyone can find a thrust-only C++ minimal reproduction please share it here so we can take a look.

I suspect that this may have been fixed in CTK 11.4 (Thrust/CUB 1.12) by NVIDIA/cub@63e2ad4, which fixed a lot of overflows that may result in InvalidConfiguration errors.

yitao-li · 2021-10-20T20:02:45Z

I don't have a minimal (i.e., thrust-only) or a reliable repro of this, but I did see this error somewhat frequently while working with cuML (specifically, the UMAP algorithm).

Based on what little I know, I would doubt it was because of the overflow issues mentioned in NVIDIA/cub@63e2ad4, mainly because:

The error happens randomly for about 1 out of 20 times, but the cuML UMAP algorithm was run with a fixed PRNG seed of 0 each time (so in theory the algortihm itself should not introduce non-determinism to the process).
The same error either does not happen at all or happens much less frequently than 1 out of 20 times when running with cuda-gdb attached, which is interesting. Could it be one of those really annoying heisenbugs? : / Would cuda-gdb change overflow behavior or cause some overflow to not happen?
I only have trivial number of data points as input to the algorithm, and I'm sure the total number of bytes in the input or the output of the algorithm will not overflow an int32_t.

Other detail: this happens to both merge_sort and radix_sort in thrust about equally frequently while I was playing around with the UMAP algorithm from cuML, i.e.,

merge_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument

radix_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument

I don't know whether this might be helpful for tracking down the issue.

alliepiper · 2021-10-20T20:20:14Z

If it's happening intermittently, you can try running your application through compute-sanitizer (or cuda-memcheck on older versions of CUDA) to check for various runtime issues. There may be a hidden race condition or bad memory access that causes the inconsistent failures.

elstehle · 2021-10-29T08:08:22Z

@yitao-li, does it only happen when working with cuml, or are you also able to create a stand-alone reproducer? Which CUDA version are you on? What I'm reading, at least in your case, does remind me a bit of #1400 (comment)

yitao-li · 2021-10-29T19:11:37Z

@elstehle Hey thanks for your reply! I'm on CUDA 11.2. I haven't managed to find a stand-alone (i.e., thrust-only) repro of this yet.

yitao-li · 2021-10-29T20:59:46Z

@elstehle I think setting the correct cuda architecture (as suggested in #1400) fixed the issue for me. Thanks a lot for your help!! 👍

itczar changed the title ~~CUDA 10 :: Thrust sort is throwing exception for 23330 float elements in device vector~~ CUDA 10 :: Thrust sort is throwing exception for device vector of 23330 float elements for gpu architecture 'compute_61' Oct 21, 2018

icoderaven mentioned this issue Jun 23, 2020

RadixSortByByte issues in 1.1.1 NVIDIA/gvdb-voxels#96

Open

diamog mentioned this issue Jul 27, 2020

particle_structures: error on construction at sorting SCOREC/pumi-pic#43

Closed

jtian0 mentioned this issue Sep 25, 2020

Out of memory error on Power 8 IBM machine szcompressor/cuSZ#6

Closed

ueqri mentioned this issue Apr 14, 2021

[Runtime error] thrust::system::system_error, Two parts of codes work fine separately, but doesn't work together. PointCloudLibrary/pcl#4700

Closed

This was referenced Oct 7, 2021

RuntimeError: radix_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument cupy/cupy#5846

Closed

RuntimeError: radix_sort: failed #1535

Closed

elstehle mentioned this issue Aug 9, 2022

Dispatch mechanism may break when any two libraries that use CUB and/thrust have been compiled for different set of GPU architectures NVIDIA/cub#545

Closed

jrhemstad added this to CCCL Aug 11, 2022

jrhemstad added the thrust label Feb 22, 2023

miscco closed this as completed Feb 24, 2023

github-project-automation bot moved this to Done in CCCL Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA 10 :: Thrust sort is throwing exception for device vector of 23330 float elements for gpu architecture 'compute_61' #936

CUDA 10 :: Thrust sort is throwing exception for device vector of 23330 float elements for gpu architecture 'compute_61' #936

itczar commented Oct 19, 2018

brycelelbach commented Oct 20, 2018

itczar commented Oct 21, 2018 •

edited by brycelelbach

Loading

itczar commented Oct 21, 2018

nebojsaandjelkovic commented Jan 24, 2019 •

edited

Loading

Cartoonman commented Dec 7, 2019

Cartoonman commented Dec 7, 2019 •

edited

Loading

Cartoonman commented Dec 7, 2019

NaikoniuM commented Feb 28, 2020

brycelelbach commented Feb 28, 2020

FabianSchuetze commented Apr 1, 2021

alliepiper commented Oct 8, 2021

yitao-li commented Oct 20, 2021 •

edited

Loading

alliepiper commented Oct 20, 2021

elstehle commented Oct 29, 2021

yitao-li commented Oct 29, 2021

yitao-li commented Oct 29, 2021

CUDA 10 :: Thrust sort is throwing exception for device vector of 23330 float elements for gpu architecture 'compute_61' #936

CUDA 10 :: Thrust sort is throwing exception for device vector of 23330 float elements for gpu architecture 'compute_61' #936

Comments

itczar commented Oct 19, 2018

brycelelbach commented Oct 20, 2018

itczar commented Oct 21, 2018 • edited by brycelelbach Loading

itczar commented Oct 21, 2018

nebojsaandjelkovic commented Jan 24, 2019 • edited Loading

Cartoonman commented Dec 7, 2019

Cartoonman commented Dec 7, 2019 • edited Loading

Cartoonman commented Dec 7, 2019

NaikoniuM commented Feb 28, 2020

brycelelbach commented Feb 28, 2020

FabianSchuetze commented Apr 1, 2021

alliepiper commented Oct 8, 2021

yitao-li commented Oct 20, 2021 • edited Loading

alliepiper commented Oct 20, 2021

elstehle commented Oct 29, 2021

yitao-li commented Oct 29, 2021

yitao-li commented Oct 29, 2021

itczar commented Oct 21, 2018 •

edited by brycelelbach

Loading

nebojsaandjelkovic commented Jan 24, 2019 •

edited

Loading

Cartoonman commented Dec 7, 2019 •

edited

Loading

yitao-li commented Oct 20, 2021 •

edited

Loading