-
Notifications
You must be signed in to change notification settings - Fork 757
Uninitialized __global__ memory in thrust::sort (cub::RadixSort) - incorrect results/segfaults in thrust::sort, thrust::remove_if, etc. #1400
Comments
Thanks -- this is very helpful! I can reproduce the issue (Just compiled From a quick triage, it does seem to be related to the sort and not the vector initialization / device references:
We'll look into this and let you know. |
Hi, Here is a table of all tested OS/CUDA variants of #1341 (comment) code and their outputs.
( ✓ finished ok, X finished with an error, / not tested) |
The issue with radix sort seems to disappear when initializing the temp_storage of cub to zero. For example, I tried to sort an integer array of length 48768, all zeros, on sm_61. However, the subsequent scan kernel then processes the full d_spine array which causes unitialized accesses. Maybe one can precompute the number of elements which are written by upsweep kernel, and pass this number as length to the scan kernel? |
So far, I've been able to reproduce the sanitizer warnings in upsweep-downsweep sort (you need to set However, I don't think this is the cause of the underlying incorrect sorting results. Though ScanBinsKernel indeed reads uninitialized data and then writes there, this shouldn't lead to problems. ScanBinsKernel never reads or writes beyond the address range allocated to I also haven't been able to reproduce incorrect sorting results (in either upsweep-downsweep or onesweep sorting) or the sanitizer warnings in the onesweep sorting. @soCzech Could you provide me with parameters (GPU, OS, CUDA version, CUB/thrust version, compiler, sorting array size etc.) that reliably produces (or at least has a good chance of producing) with your example code above (main.cu) either of the following:
|
@canonizer Here is a repo https://github.com/soCzech/thrust-bug with a dockerfile and all build instructions. Let me know if it works for you ;) It works (i.e. produces wrong results) on GP107M [GeForce GTX 1050 Ti Mobile] with driver 450.102.04 but should work on more devices / drivers. |
I tried the repository you had linked. However, I compiled it for compute capability 6.0, without the Pytorch libraries and without docker. I ran the executable 10 times on GP100. I haven't been able to get an incorrect sorting result. Is anything in the docker file, linking with the Pytorch library, or compiling for a particular architecture required to reproduce the bug? I'll try to run it on the particular GPU you mentioned. Do you know any other device/driver combinations on which it produces wrong results? |
@canonizer If Pytorch is not linked, the error does not occur. We observed that any small change could alter the result of our internal code (i.e. sometimes run correctly but also sometimes produce a different error in different thrust function call). Therefore I do not think the issue is the pytorch itself but rather it changes the binary or relative location of the code / data that results in the error. Maybe just try running the code in docker and see if you can reproduce the issue and then maybe try to investigate further without the docker? The code also produces the error on GeForce RTX 2070 Mobile with driver 460.56 or GeForce RTX 2080 Ti with driver 450.102.04. |
I'll try to run it in docker. Have you been able to reproduce the issue by linking a library with GPU code other than Pytorch? |
Not that I am aware of because we were stripping down our code to create this minimal example and our code does not contain many third-party libraries. I think adding some other libraries "fixed" the wrong results issue. Also, when we encoutered the |
I managed to reproduce the problem on GP100 when running under Docker and linking with Pytorch. However, I also found out that in this case, the executable links dynamically against 2 different versions of libcudart: one from the CUDA toolkit, and one from PyTorch. When one of them is removed, e.g. by not linking either against Have you been able to reproduce the problem with only one version of libcudart linked dynamically? |
@canonizer Sorry, my bad, I should have checked what is being linked. Indeed, in our code, removing the CUDA toolkit libcudart.so and linking only the pytorch one seems to resolve the issue. But when I tried to build pytorch from source (so it uses the only available libcudart.so from CUDA toolkit) the issue occured again and this time I checked and no library is being linked twice. Can you investigate the issue anyway or is the issue in this particular example caused by the two versions of libcudart.so clashing? |
I can take a look. Could you prepare a reproducer for that case, and also check that the error occurs when libcudart is linked only once (either statically or dynamically)? |
It is dificult to create reproducer as many of the errors are deep in our internal code and appear/disappear when slightly changing a code or a version of a linked library. But does your fix for
also fixes
or is this completely different problem I am getting? |
So far, I haven't been able to reproduce the uninitialized read in My fix is for the uninitialized accesses in a different pair of kernels, upsweep/downsweep. That's the only place where we were able to get uninitialized accesses so far. |
Yes, https://github.com/soCzech/thrust-bug reproduces uninitialized read in Actually I wasn't aware of the issue in |
@canonizer Ok, now I used pytorch built from source in https://github.com/soCzech/thrust-bug reproducer and:
So it seems the twice linked |
Hi, any update? |
Hey @soCzech, we've been busy with GTC stuff lately and things are generally pretty hectic right now. I may not have time to look into this until mid-May at the earliest, unfortunately. |
I'm investigating. I think this relates to an issue I had encountered before. I will follow up with a more detailed analysis and write up shortly. |
I'm still analyzing, but here's where we stand: For now, if you're urgently in the need for a work around, I'd suggest to make sure to compile all libs that you have control over for the same set of architectures, if that is feasible? As for your example, the
Best is to verify that things get compiled for the right arch's building with [1] Thanks to @robertmaynard for helping with the correct |
Yes, compiling multiple times with different arch flags will definitely cause problems. Thanks for the write up! Since the dispatch mechanism came up -- I'm currently rewriting it for other reasons, so don't spend too much time digging into the current implementation. The new version will also require that arch flags match across CUB translation units. |
Thank you so much! I thought that running code with wrong compute capability raised an error and I am prety sure I have seen something like "not compiled for your compute capability" before, but maybe that is a different story :D I have all libs, especially pytorch, custom compiled for my architecture, but only the main program I unknowingly compiled for wrong architecture. I can confirm that when compiled with the right compute capability, both Wrong sort and Uninitialized __global__ memory errors disappear. We were not investigating in this direction when we observed that when pytorch is not linked the code produced correct result:) To sum up for future readers, here is the cmake/nvcc version/argument breakdown: CMake 3.18 with CMake 3.18 or CMake 3.17 without setting CMake 3.18 without setting Thank you so much again! |
Awesome, sounds like we can close this once NVIDIA/cub#277 is in. We should be able to get that done for the next release 👍 |
We have been getting weird errors in thrust functions
sort_by_key
,sort
andremove_if
in our custom code or in third-party code such as flann (kdtree on cuda) and MinkowskiEngine (pytorch custom lib). After a thorough investigation, we discovered that the mentioned functions sometimes randomly produce wrong results (sorted vectors contain values that were not in the original vectors, remove_if does not remove elements matching a condition, etc). Firstly, we thought the issues are related to pytorch, as they occurred when we linked pytorch lib, but afterward, we were able to produce a minimal example with errors even without any pytorch stuff. Also the errors seem to randomly appear or disappear when a line of code is added/removed or a library (eg. pytorch) is linked (but not used). I suppose this suggests there is some problem related to a physical address of the code/data.We tested our binaries with
compute-sanitizer --tool initcheck
and in cases whenthrust::sort
orthrust::remove_if
returned corrupted results we got e.g.Uninitialized __global__ memory read of size 4 bytes...
errors.As mentioned above, when we removed/added some code/library that did not affect the actual computation the results were miraculously fixed but
compute-sanitizer --tool initcheck
still returned the error. Therefore it seems sometimes the uninitialized memory actually contains the value it should be initialized with and everything runs okay-ish.We tested many versions of the example (bellow) as well as many versions of our internal code on at least:
devel
ubuntu18.04
andubuntu20.04
docker images with cuda10.1
,10.2
,11.0
,11.1
,11.2
The issues were present in every setup with slight variations - e.g. changing cuda seemed to fix the issue but adding an independent line of code broke the code again.
We tested this particular example also on Windows and it seems it is the only place where the code runs without
Uninitialized __global__ memory
warning. But due to compilation difficulties, we were not able to compile our other programs with the same issue and test them yet.To reproduce one of the issues, create
main.cu
,Dockerfile
andCMakeLists.txt
(file contents below) and run the following commands:You should get the following output:
When pytorch libs and a specific version of thust is linked we also get
Host and device vector doesn't match!
aside from theUninitialized __global__ memory
warning. Sometimes, in different setups, we gotUninitialized __global__ memory read of size 1 bytes ...
orFloating point exception (core dumped)
.Also, we got the uninitialized memory warning when calling
thrust::remove_if
in one place of our code. Similarily to thethrust::sort
the warning occurred when the outcome of the function was incorrect but it also occurred when the outcome was (probably by chance) correct:Also a similar example of probably the same problem was mentioned by us in thrust issue #1341 (comment) and pytorch issue pytorch/pytorch#52663.
The files:
main.cu
Dockerfile
CMakeLists.txt
I'll gladly provide other examples if necessary. @allisonvacanti
The text was updated successfully, but these errors were encountered: