-
Notifications
You must be signed in to change notification settings - Fork 757
CUDA 10 :: Thrust sort is throwing exception for device vector of 23330 float elements for gpu architecture 'compute_61' #936
Comments
Hi, This report does not have enough information to be actionable. Please read the guidelines here and provide an updated report. |
Initially i thought it was problem related to memory size specifically but it does not seem so. following code is not behaving correctly with compiler option -gencode arch=compute_61,code=compute_61
**when it is run with thrust 1.9.3 ( shipped with CUDA 10) giving -gencode arch=compute_50,code=compute_50 -gencode arch=compute_61,code=compute_61 as compiler option, it is throwing exception "radix sort failed on 2nd step invalid argument" Although when option -gencode arch=compute_61,code=compute_61 is removed and only option -gencode arch=compute_50,code=compute_50 is used, it is working fine.** It seems some problem is related to option compute_61 for compute capability 6.1 Please provide some workaround or previous stable thrust version compatible with CUDA 10 if possible so that related work may be unblocked. |
i ran even with thrust 1.9.2 ( shipped with CUDA9.2) i.e. CUDA 10 + thrust 1.9.2 , it works fine without any issue for compute_61 |
I have the same problem with excliusive scan. Here is code:
and the error I am getting is:
The thing is that this code is randomly failing and sometimes is succesful. I tried cuda 9.0, 9.2 and 10.0 all with arch 6.1 on Titan X GPU. Anyone has solution? @itczar @brycelelbach |
Also having this issue, 6.1 CC GPU w/ CUDA 10.2. As others have said, this was noticed past CUDA 9.0. |
Update: After digging into the source of the thrown exception, it is happening inside In my particular case, I am running For larger input sizes, the dispatch selects the 'Normal problem size invocation' which uses a 3-step pass: an upsweep, a scan, and then a downsweep. In this call, the very first attempt to run The debug output for a ~1000k input size is:
It seems that the thread count of This is running on a P4000 GPU on CUDA 10.2. I believe there must be an incorrect calculation being done in determining the correct number of threads per block, which is causing this error. Any insights from the nvidia team @brycelelbach? In the meantime I'm going to try to hunt for where this calculation is done to see if I can correct it manually. |
Update 2: Following what @itczar mentioned above, I compiled to different SM_XX (my code defaults to the GPU's max capable, which is 6.1) When I forced it to compile at sm_60 instead of sm_61, |
Is there any news on this topic? I ran into the same issues (thrust::sort_by_key), but trying to compile for different architectures did not result in working code. I can confirm @Cartoonman that something is going in in If this cannot be fixed, is there any any workaround in thrust or a different CUDA-based library I can use? Windows 10, CUDA 10.2 The call of the thrust function looks something like this: Thank you very much, any help is very much appreciated! |
We really need a minimal test case that reproduced the problem. Please see these guidelines to understand what we're looking for. I haven't been able to reproduce this myself yet. |
I have the same issue as mentioned above. When I use
on a QuadroT1000 the program reports the error mentioned above depending on the input size. If I want to sort 102,499 elements the program runs fine. If I instead choose to sort 102,500 elements, the program fails with the message:
I am working on Ubuntu 20.04 and |
We've been trying to find a repro for #936 for a while, but haven't been able to replicate / debug it to figure out what's going on. If anyone can find a thrust-only C++ minimal reproduction please share it here so we can take a look. I suspect that this may have been fixed in CTK 11.4 (Thrust/CUB 1.12) by NVIDIA/cub@63e2ad4, which fixed a lot of overflows that may result in InvalidConfiguration errors. |
I don't have a minimal (i.e., thrust-only) or a reliable repro of this, but I did see this error somewhat frequently while working with Based on what little I know, I would doubt it was because of the overflow issues mentioned in NVIDIA/cub@63e2ad4, mainly because:
Other detail: this happens to both merge_sort and radix_sort in thrust about equally frequently while I was playing around with the UMAP algorithm from
I don't know whether this might be helpful for tracking down the issue. |
If it's happening intermittently, you can try running your application through |
@yitao-li, does it only happen when working with cuml, or are you also able to create a stand-alone reproducer? Which CUDA version are you on? What I'm reading, at least in your case, does remind me a bit of #1400 (comment) |
@elstehle Hey thanks for your reply! I'm on CUDA 11.2. I haven't managed to find a stand-alone (i.e., thrust-only) repro of this yet. |
For type float ,if no of elements ,let's say 2330 then there is no issue. But if the number is 23330,then thrust::sort is throwing exception saying "radix sort failed on 2nd step invalid argument".
Please help.
GRAPHICS CARD in use is P2000 and CUDA version is 10
The text was updated successfully, but these errors were encountered: