Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Uninitialized __global__ memory in thrust::sort (cub::RadixSort) - incorrect results/segfaults in thrust::sort, thrust::remove_if, etc. #1400

Closed
soCzech opened this issue Mar 12, 2021 · 23 comments
Assignees
Labels
P1: should have Necessary, but not critical. type: bug: functional Does not work as intended.
Milestone

Comments

@soCzech
Copy link

soCzech commented Mar 12, 2021

We have been getting weird errors in thrust functions sort_by_key, sort and remove_if in our custom code or in third-party code such as flann (kdtree on cuda) and MinkowskiEngine (pytorch custom lib). After a thorough investigation, we discovered that the mentioned functions sometimes randomly produce wrong results (sorted vectors contain values that were not in the original vectors, remove_if does not remove elements matching a condition, etc). Firstly, we thought the issues are related to pytorch, as they occurred when we linked pytorch lib, but afterward, we were able to produce a minimal example with errors even without any pytorch stuff. Also the errors seem to randomly appear or disappear when a line of code is added/removed or a library (eg. pytorch) is linked (but not used). I suppose this suggests there is some problem related to a physical address of the code/data.

We tested our binaries with compute-sanitizer --tool initcheck and in cases when thrust::sort or thrust::remove_if returned corrupted results we got e.g. Uninitialized __global__ memory read of size 4 bytes... errors.
As mentioned above, when we removed/added some code/library that did not affect the actual computation the results were miraculously fixed but compute-sanitizer --tool initcheck still returned the error. Therefore it seems sometimes the uninitialized memory actually contains the value it should be initialized with and everything runs okay-ish.

We tested many versions of the example (bellow) as well as many versions of our internal code on at least:

  • nvidia devel ubuntu18.04 and ubuntu20.04 docker images with cuda 10.1, 10.2, 11.0, 11.1, 11.2
  • on ubuntu20.04 and arch linux distributions
  • with default thrust/cub and the latest thrust/cub (thrust version 1.12)

The issues were present in every setup with slight variations - e.g. changing cuda seemed to fix the issue but adding an independent line of code broke the code again.

We tested this particular example also on Windows and it seems it is the only place where the code runs without Uninitialized __global__ memory warning. But due to compilation difficulties, we were not able to compile our other programs with the same issue and test them yet.


To reproduce one of the issues, create main.cu, Dockerfile and CMakeLists.txt (file contents below) and run the following commands:

docker build -t test-docker-image .
docker run -it --gpus 1 -v $(pwd):/xxx -w /xxx test-docker-image bash
mkdir build
cd build
cmake ..
make
compute-sanitizer --tool initcheck bug_test

You should get the following output:

RUN 0, NUM 128, dev_ptr 0x7fe1c5800000: OK! 
RUN 0, NUM 256, dev_ptr 0x7fe1c5800000: OK!
...
========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x5b8 in void cub::DeviceRadixSortOnesweepKernel<cub::DeviceRadixSortPolicy<int,cub::NullType,int>::Policy800,bool=0,int,cub::NullType,int,int>(int*,int,bool=0*,int* const *,int*,int* const * const *,cub::DeviceRadixSortPolicy<int,cub::NullType,int>::Policy800*,int* const * const * const *,int*,int,int)
=========     by thread (214,0,0) in block (0,0,0)
=========     Address 0x7f13cdc09dd8
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel [0x7f1402c1ba6e]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x7f1415e1862b]
=========                in /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
=========     Host Frame:cudaLaunchKernel [0x7f1415e585b1]
=========                in /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
...

When pytorch libs and a specific version of thust is linked we also get Host and device vector doesn't match! aside from the Uninitialized __global__ memory warning. Sometimes, in different setups, we got Uninitialized __global__ memory read of size 1 bytes ... or Floating point exception (core dumped).

Also, we got the uninitialized memory warning when calling thrust::remove_if in one place of our code. Similarily to the thrust::sort the warning occurred when the outcome of the function was incorrect but it also occurred when the outcome was (probably by chance) correct:

========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x1d68 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::cuda_cub::__copy_if::no_stencil_tag_*,thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::detail::unary_negate<minkowski::detail::is_first<unsigned int>>,int,int*>,thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::cuda_cub::__copy_if::no_stencil_tag_*,thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::detail::unary_negate<minkowski::detail::is_first<unsigned int>>,int,int*,cub::ScanTileState<int,bool=1>,unsigned long>(unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type)
=========     by thread (30,0,0) in block (0,0,0)
=========     Address 0x7fa75a017170
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel [0x7fa78effea6e]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x7fa7fb2ec62b]
=========                in /usr/local/cuda/lib64/libcudart.so.11.0
=========     Host Frame:cudaLaunchKernel [0x7fa7fb32c5b1]
=========                in /usr/local/cuda/lib64/libcudart.so.11.0

Also a similar example of probably the same problem was mentioned by us in thrust issue #1341 (comment) and pytorch issue pytorch/pytorch#52663.


The files:

main.cu

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

int main() 
{
    for (size_t NUM = 128; NUM < 32768; NUM+=128) 
    {
        for (int run = 0; run < 1; run++) {
            thrust::host_vector<int> h(NUM);
            thrust::device_vector<int> d(NUM);
            for (int i = 0; i < NUM; i++) {
                int random_number = rand() * 1000;
                h[i] = random_number;
                d[i] = random_number;
            }
            thrust::sort(h.begin(), h.end());
            thrust::sort(d.begin(), d.end());
    
            thrust::host_vector<int> d_host(d.begin(), d.end());
            bool sort_ok = thrust::equal(
                d_host.begin(), d_host.end() - 1, d_host.begin() + 1,	
                thrust::less_equal<int>());
            bool match = thrust::equal(d_host.begin(), d_host.end(), h.begin());

            std::cout << "RUN " << run << ", NUM " << NUM;
            std::cout << ", dev_ptr " << static_cast<void*>(thrust::raw_pointer_cast(d.data())) << ": ";
            if (sort_ok && match) { std::cout << "OK! "; }
            if (!sort_ok) { std::cout << "Wrong sort! "; }
            if (!sort_ok) { std::cout << "Host and device vector doesn't match! "; }
            std::cout << std::endl;
        }
    }

    return 0;
}

Dockerfile

FROM nvidia/cuda:11.0-devel-ubuntu20.04
RUN apt-get update && apt-get install -y wget
RUN wget -qO- "https://cmake.org/files/v3.17/cmake-3.17.5-Linux-x86_64.tar.gz" | tar --strip-components=1 -xz -C /usr/local

CMakeLists.txt

cmake_minimum_required(VERSION 3.17.5)
project(bug_test CUDA CXX)
add_executable(bug_test main.cu)
target_compile_options(bug_test PUBLIC $<$<COMPILE_LANGUAGE:CUDA>:-arch compute_XX>)
target_compile_options(bug_test PUBLIC $<$<COMPILE_LANGUAGE:CUDA>:-code sm_XX>)
target_compile_features(bug_test PRIVATE cuda_std_14)

I'll gladly provide other examples if necessary. @allisonvacanti

@alliepiper
Copy link
Collaborator

Thanks -- this is very helpful!

I can reproduce the issue (Just compiled main.cu with NVCC 11.2 + GCC 7.5 on Ubuntu 18.04).

From a quick triage, it does seem to be related to the sort and not the vector initialization / device references:

  • Replacing the copies with thrust::transform(d.begin(), d.end(), d.begin(), thrust::negate<>{}) does not cause uninitialized access, so the data is getting initialized properly.
  • Adding an explicit cudaDeviceSynchronize() before the device sort doesn't fix the memory issue, so it's not a race.

We'll look into this and let you know.

@alliepiper alliepiper added the type: bug: functional Does not work as intended. label Mar 13, 2021
@alliepiper alliepiper added this to the 1.13.0 milestone Mar 13, 2021
@soCzech
Copy link
Author

soCzech commented Mar 21, 2021

Hi,
We have another bad news. It seems this issue is not limited to linux but occurs on windows as well.
Even though we were able to compile the code mentioned above without any errors on windows, modified version with thrust::sort_by_key and pytorch mentioned in #1341 (comment) and linked with libtorch-win-shared-with-deps-1.8.0 / libtorch-cxx11-abi-shared-with-deps-1.8.0 produces an error also on windows.

Here is a table of all tested OS/CUDA variants of #1341 (comment) code and their outputs.

OS RUN MEMCHECK INITCHECK CUDA COMPILER NVCC ARCH GPU DRIVER OTHER ERRORS
Windows 10 11.1.1 (Oct 2020) MSVC 19.28.29913.0 11.1.105 61 GeForce GTX 1060 with Max-Q Design 461.92 Game Ready WinSDK 10.0.19042 -
Windows 10 X X / 11.2.1 (Feb 2021) MSVC 19.28.29913.0 11.2.142 50 GeForce 940M 461.92 Game Ready WinSDK 10.0.19042 RUN: exception: thrust::system::system_error
Windows 10 X X / 11.2.2 (March 2021) MSVC 19.28.29913.0 11.2.152 50 GeForce 940M 461.92 Game Ready WinSDK 10.0.19042 RUN: exception: thrust::system::system_error
Ubuntu 20.04 X / / ?11.0 (May 2020)? GNU 9.3.0 11.0.221 75 GeForce RTX 2070 460.56 - RUN: 'thrust::system::system_error' what(): radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal
Ubuntu 20.04 X / / 11.1.1 (Oct 2020) GNU 9.3.0 11.1.105 75 GeForce RTX 2070 460.56 - RUN: 'thrust::system::system_error' what(): radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal
Ubuntu 20.04 X 11.2.1 (Feb 2021) GNU 9.3.0 11.2.142 75 GeForce RTX 2070 460.56 - MEMCHECK: Program hit out of memory (error 2) on CUDA API call to cudaMalloc.
Ubuntu 20.04 X 11.2.2 (March 2021) GNU 9.3.0 11.2.152 75 GeForce RTX 2070 460.56 - MEMCHECK: Program hit out of memory (error 2) on CUDA API call to cudaMalloc.

( ✓ finished ok, X finished with an error, / not tested)

@fkallen
Copy link
Contributor

fkallen commented Mar 25, 2021

The issue with radix sort seems to disappear when initializing the temp_storage of cub to zero.
The issue occurs because the d_spine array is not initialized prior to upsweep kernel. This kernel does not write (or rather, might not write) every element of d_spine. https://github.com/NVIDIA/cub/blob/main/cub/device/dispatch/dispatch_radix_sort.cuh#L1120

For example, I tried to sort an integer array of length 48768, all zeros, on sm_61.
Invoking upsweep_kernel<<<6, 128, 0, 0>>>(), 16 items per thread, 12 SM occupancy, current bit 0, bit_grain 6
Then, spine_length is 30976, but only the first 384 elements are computed by the upsweep kernel. The remaining elements are left with undetermined values.

However, the subsequent scan kernel then processes the full d_spine array which causes unitialized accesses.

Maybe one can precompute the number of elements which are written by upsweep kernel, and pass this number as length to the scan kernel?

@canonizer
Copy link

So far, I've been able to reproduce the sanitizer warnings in upsweep-downsweep sort (you need to set ONESWEEP = false in the sorting policy). The warnings occur in RadixSortScanBinsKernel, as it indeed can read a memory address that hasn't been previously written to. NVIDIA/cub#277 should fix that particular problem.

However, I don't think this is the cause of the underlying incorrect sorting results. Though ScanBinsKernel indeed reads uninitialized data and then writes there, this shouldn't lead to problems. ScanBinsKernel never reads or writes beyond the address range allocated to d_spine, and the subsequent downsweep kernel never reads beyond the range that is computed from the properly initialized portion of d_spine.

I also haven't been able to reproduce incorrect sorting results (in either upsweep-downsweep or onesweep sorting) or the sanitizer warnings in the onesweep sorting.

@soCzech Could you provide me with parameters (GPU, OS, CUDA version, CUB/thrust version, compiler, sorting array size etc.) that reliably produces (or at least has a good chance of producing) with your example code above (main.cu) either of the following:

  • incorrect sorting results (with either upsweep-downsweep or onesweep kernels)
  • sanitizer warnings in the DeviceRadixSortOnesweepKernel (like in the first message of this issue)

@soCzech
Copy link
Author

soCzech commented Mar 29, 2021

@canonizer Here is a repo https://github.com/soCzech/thrust-bug with a dockerfile and all build instructions. Let me know if it works for you ;)

It works (i.e. produces wrong results) on GP107M [GeForce GTX 1050 Ti Mobile] with driver 450.102.04 but should work on more devices / drivers.

@brycelelbach brycelelbach added the P1: should have Necessary, but not critical. label Mar 29, 2021
@canonizer
Copy link

I tried the repository you had linked. However, I compiled it for compute capability 6.0, without the Pytorch libraries and without docker. I ran the executable 10 times on GP100. I haven't been able to get an incorrect sorting result.

Is anything in the docker file, linking with the Pytorch library, or compiling for a particular architecture required to reproduce the bug?

I'll try to run it on the particular GPU you mentioned. Do you know any other device/driver combinations on which it produces wrong results?

@soCzech
Copy link
Author

soCzech commented Mar 30, 2021

@canonizer If Pytorch is not linked, the error does not occur. We observed that any small change could alter the result of our internal code (i.e. sometimes run correctly but also sometimes produce a different error in different thrust function call). Therefore I do not think the issue is the pytorch itself but rather it changes the binary or relative location of the code / data that results in the error.

Maybe just try running the code in docker and see if you can reproduce the issue and then maybe try to investigate further without the docker?

The code also produces the error on GeForce RTX 2070 Mobile with driver 460.56 or GeForce RTX 2080 Ti with driver 450.102.04.

@canonizer
Copy link

I'll try to run it in docker.

Have you been able to reproduce the issue by linking a library with GPU code other than Pytorch?

@soCzech
Copy link
Author

soCzech commented Mar 30, 2021

Have you been able to reproduce the issue by linking a library with GPU code other than Pytorch?

Not that I am aware of because we were stripping down our code to create this minimal example and our code does not contain many third-party libraries. I think adding some other libraries "fixed" the wrong results issue. Also, when we encoutered the Uninitialized __global__ memory we thought it has the same cause as the wrong results.

@canonizer
Copy link

I managed to reproduce the problem on GP100 when running under Docker and linking with Pytorch.

However, I also found out that in this case, the executable links dynamically against 2 different versions of libcudart: one from the CUDA toolkit, and one from PyTorch.

When one of them is removed, e.g. by not linking either against libtorch.so and libtorch_cuda.so, or by not linking against /usr/local/cuda/lib64/libcudart.so, the problem does not occur. When none of the libraries above are linked, the problem also does not occur.

Have you been able to reproduce the problem with only one version of libcudart linked dynamically?

@soCzech
Copy link
Author

soCzech commented Apr 1, 2021

@canonizer Sorry, my bad, I should have checked what is being linked. Indeed, in our code, removing the CUDA toolkit libcudart.so and linking only the pytorch one seems to resolve the issue. But when I tried to build pytorch from source (so it uses the only available libcudart.so from CUDA toolkit) the issue occured again and this time I checked and no library is being linked twice.

Can you investigate the issue anyway or is the issue in this particular example caused by the two versions of libcudart.so clashing?

@canonizer
Copy link

I can take a look. Could you prepare a reproducer for that case, and also check that the error occurs when libcudart is linked only once (either statically or dynamically)?

@soCzech
Copy link
Author

soCzech commented Apr 1, 2021

It is dificult to create reproducer as many of the errors are deep in our internal code and appear/disappear when slightly changing a code or a version of a linked library. But does your fix for

========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x5b8 in void cub::DeviceRadixSortOnesweepKernel<cub::DeviceRadixSortPolicy<int,cub::NullType,int>::Policy800,bool=0,int,cub::NullType,int,int>(int*,int,bool=0*,int* const *,int*,int* const * const *,cub::DeviceRadixSortPolicy<int,cub::NullType,int>::Policy800*,int* const * const * const *,int*,int,int)

also fixes

========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x1d68 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::cuda_cub::__copy_if::no_stencil_tag_*,thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::detail::unary_negate<minkowski::detail::is_first<unsigned int>>,int,int*>,thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::cuda_cub::__copy_if::no_stencil_tag_*,thrust::zip_iterator<thrust::tuple<unsigned int*,unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::detail::unary_negate<minkowski::detail::is_first<unsigned int>>,int,int*,cub::ScanTileState<int,bool=1>,unsigned long>(unsigned int*,unsigned int*,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type)

or is this completely different problem I am getting?

@canonizer
Copy link

So far, I haven't been able to reproduce the uninitialized read in DeviceRadixSortOnesweepKernel. Does https://github.com/soCzech/thrust-bug also reproduce it (even if it requires compilation with 2 instances of cudart), or is a different reproducer required?

My fix is for the uninitialized accesses in a different pair of kernels, upsweep/downsweep. That's the only place where we were able to get uninitialized accesses so far.

@soCzech
Copy link
Author

soCzech commented Apr 1, 2021

Yes, https://github.com/soCzech/thrust-bug reproduces uninitialized read in DeviceRadixSortOnesweepKernel.

Actually I wasn't aware of the issue in RadixSortScanBinsKernel you discovered in the reproducer. The reproducer was meant from the beginning to show the issue with DeviceRadixSortOnesweepKernel.

@soCzech
Copy link
Author

soCzech commented Apr 1, 2021

@canonizer Ok, now I used pytorch built from source in https://github.com/soCzech/thrust-bug reproducer and:

  1. indeed no library is linked twice
$ ldd build/bug_test 
    linux-vdso.so.1 (0x00007fffa1378000)
    libtorch_cuda.so => /tmp/torch/pytorch-install/lib/libtorch_cuda.so (0x00007fe98f62a000)
    libblas.so.3 => /lib/x86_64-linux-gnu/libblas.so.3 (0x00007fe98f5b8000)
    libcudart.so.11.0 => /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0 (0x00007fe98f33a000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe98f32f000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe98f30c000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe98f306000)
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fe98f123000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe98efd4000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe98efb9000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe98edc7000)
    libc10_cuda.so => /tmp/torch/pytorch-install/lib/libc10_cuda.so (0x00007fe98ed97000)
    libcusparse.so.11 => /usr/local/cuda/lib64/libcusparse.so.11 (0x00007fe985252000)
    libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007fe9806e4000)
    libcusolver.so.10 => /usr/local/cuda/lib64/libcusolver.so.10 (0x00007fe95f8c5000)
    libcudnn.so.8 => /lib/x86_64-linux-gnu/libcudnn.so.8 (0x00007fe95f69c000)
    libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007fe95f493000)
    libnccl.so.2 => /usr/local/cuda/lib64/libnccl.so.2 (0x00007fe958a53000)
    libc10.so => /tmp/torch/pytorch-install/lib/libc10.so (0x00007fe9589e7000)
    libtorch_cpu.so => /tmp/torch/pytorch-install/lib/libtorch_cpu.so (0x00007fe95135e000)
    libcufft.so.10 => /usr/local/cuda/lib64/libcufft.so.10 (0x00007fe94749a000)
    libcublas.so.11 => /usr/local/cuda/lib64/libcublas.so.11 (0x00007fe94164a000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fe999fa0000)
    libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fe941608000)
    libcublasLt.so.11 => /usr/local/cuda/lib64/libcublasLt.so.11 (0x00007fe936477000)
  1. Uninitialized read in DeviceRadixSortOnesweepKernel is still there
========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x558 in void cub::DeviceRadixSortOnesweepKernel<cub::DeviceRadixSortPolicy<int,cub::NullType,int>::Policy800,bool=0,int,cub::NullType,int,int>(int*,int,bool=0*,int* const *,int*,int* const * const *,cub::DeviceRadixSortPolicy<int,cub::NullType,int>::Policy800*,int* const * const * const *,int*,int,int)
  1. Even getting the wrong sort results RUN 0, NUM 5504, dev_ptr 0x7ff8b3a00000: Wrong sort! Host and device vector doesn't match!

So it seems the twice linked cudart is not the cause of the problem. If you wish, I can update the reproducer with pytorch building (but the build itself takes a lot of time).

@soCzech
Copy link
Author

soCzech commented Apr 13, 2021

Hi, any update?
I have tested linking pytorch statically and I am still getting the error of Uninitialized __global__ memory in cub::DeviceRadixSortOnesweepKernel.

@alliepiper
Copy link
Collaborator

Hey @soCzech, we've been busy with GTC stuff lately and things are generally pretty hectic right now. I may not have time to look into this until mid-May at the earliest, unfortunately.

@elstehle
Copy link
Collaborator

I'm investigating. I think this relates to an issue I had encountered before. I will follow up with a more detailed analysis and write up shortly.

@elstehle
Copy link
Collaborator

I'm still analyzing, but here's where we stand:
There's a more fundamental issue with the dispatch mechanism (tuning policies), when linking objects that have been compiled for different sets of architectures, when these objects define the same symbols / kernels. This is the case here, as both pytorch and your source file have some thrust / CUB kernels in common (especially the EmptyKernel) - and your source file, by mistake, gets compiled for 5.2 only.

For now, if you're urgently in the need for a work around, I'd suggest to make sure to compile all libs that you have control over for the same set of architectures, if that is feasible?

As for your example, the CMakeLists.txt needs to be changed to:

# I think CUDA_ARCHITECTURES only was introduced with 3.18:
cmake_minimum_required(VERSION 3.18)
...
# this line also needs to be changed[1] to:
set_target_properties(bug_test PROPERTIES
        CUDA_ARCHITECTURES "50;61;72;75;80"
)

Best is to verify that things get compiled for the right arch's building with --verbose' like cmake --build . --verbose --parallel 4`.

[1] Thanks to @robertmaynard for helping with the correct set_target_properties signature.

@alliepiper
Copy link
Collaborator

Yes, compiling multiple times with different arch flags will definitely cause problems. Thanks for the write up!

Since the dispatch mechanism came up -- I'm currently rewriting it for other reasons, so don't spend too much time digging into the current implementation. The new version will also require that arch flags match across CUB translation units.

@soCzech
Copy link
Author

soCzech commented Apr 20, 2021

Thank you so much! I thought that running code with wrong compute capability raised an error and I am prety sure I have seen something like "not compiled for your compute capability" before, but maybe that is a different story :D

I have all libs, especially pytorch, custom compiled for my architecture, but only the main program I unknowingly compiled for wrong architecture. I can confirm that when compiled with the right compute capability, both Wrong sort and Uninitialized __global__ memory errors disappear. We were not investigating in this direction when we observed that when pytorch is not linked the code produced correct result:) To sum up for future readers, here is the cmake/nvcc version/argument breakdown:

CMake 3.18 with set_target_properties(bug_test PROPERTIES CUDA_ARCHITECTURES "61") runs nvcc with argument --generate-code=arch=compute_61,code=[compute_61,sm_61], everything works ✔️

CMake 3.18 or CMake 3.17 without setting CUDA_ARCHITECTURES but setting target_compile_options(bug_test PUBLIC $<$<COMPILE_LANGUAGE:CUDA>:--generate-code=arch=compute_61,code=[compute_61,sm_61]>) runs nvcc with argument --generate-code=arch=compute_61,code=[compute_61,sm_61], everything works ✔️

CMake 3.18 without setting CUDA_ARCHITECTURES or CMake 3.17 without explicit flag --generate-code=arch=compute_61,code=[compute_61,sm_61] runs nvcc that uses default compute capability, which is 5.2 for cuda 11. It produces Wrong sort and Uninitialized __global__ memory errors ❌

Thank you so much again!
Best,
Tomas

@alliepiper
Copy link
Collaborator

Awesome, sounds like we can close this once NVIDIA/cub#277 is in. We should be able to get that done for the next release 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1: should have Necessary, but not critical. type: bug: functional Does not work as intended.
Projects
None yet
Development

No branches or pull requests

6 participants