Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

warning: calling a __host__ function [...] from a __host__ __device__ function [...] #949

Closed
95A31 opened this issue Jan 6, 2019 · 5 comments

Comments

@95A31
Copy link

95A31 commented Jan 6, 2019

Hello,

Simple use of thrust::sort causes compiling warnings and runtime error.

I get the following warnings:

/usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h(1507): warning: calling a __host__ function("thrust::detail::aligned_reinterpret_cast<int *, unsigned char *> ") from a __host__ __device__ function("thrust::cuda_cub::__radix_sort::radix_sort< ::thrust::detail::integral_constant<bool, (bool)0> ,  ::thrust::cuda_cub::par_t, int, int, long,  ::thrust::less<int> > ") is not allowed

/usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h(1510): warning: calling a __host__ function("thrust::detail::aligned_reinterpret_cast<int *, unsigned char *> ") from a __host__ __device__ function("thrust::cuda_cub::__radix_sort::radix_sort< ::thrust::detail::integral_constant<bool, (bool)0> ,  ::thrust::cuda_cub::par_t, int, int, long,  ::thrust::less<int> > ") is not allowed

At runtime the program crash:

Bus error (core dumped)

This is a MWE:

#include <iostream>
#include <thrust/sort.h>

#define ARRAY_SIZE 10

__global__
void sortingKernel(int* arrayToSort, int arraySize)
{
    int* begin = arrayToSort;
    int* end = arrayToSort + arraySize;

    thrust::sort(thrust::device, begin, end);
}

int main(void)
{
    int arraySize = ARRAY_SIZE;

    int* arrayToSort;
    cudaMallocManaged(&arrayToSort, sizeof(int) * arraySize);
    for(int i = 0; i < arraySize; i += 1)
    {
        arrayToSort[i] = arraySize - i;
    }

    std::cout << "Before: ";
    for(int i = 0; i < arraySize; i += 1)
    {
        std::cout << arrayToSort[i] << " ";
    }
    std::cout << std::endl;

    sortingKernel<<<1,1>>>(arrayToSort, arraySize);
    cudaDeviceSynchronize();

    std::cout << "After: ";
    for(int i = 0; i < arraySize; i += 1)
    {
        std::cout << arrayToSort[i] << " ";
    }
    std::cout << std::endl;

    return 0;
}

I compile it with the following command:

nvcc -arch=sm_35 -rdc=true -g Sort.cu -lcudadevrt -o Sort

The running system is:

OS: Ubuntu Server 18.04 LTS
Kernel version: 4.15.0-43
CUDA version: 10.0
Thrust version: 1.9
Driver version: 410.79
GPU: Tesla K40c

@nferenc13
Copy link

nferenc13 commented Mar 7, 2019

I have tried the same (dynamic parallelism) with thrust::sort and it doesn't work either.

#include <stdio.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

#define N 10

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true) {
   if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort) exit(code);
   }
}

__global__ void test_kernel( int* data, int size, int * sum ) {
    *sum = thrust::reduce( thrust::seq, data , data + size, (int)0, thrust::plus<int>() );
//    *sum = thrust::reduce( thrust::device, data , data + size, (int)0, thrust::plus<int>() );
}

int main() {
    int *h_data = (int *)malloc( N * sizeof(int) );
    int sum  ;
    for ( int i = 0 ; i < N ; i++ )
        h_data[i] = 1 ;
    int *d_data; gpuErrchk(cudaMalloc((void**)&d_data,     N * sizeof(int) ));
    int *d_sum;  gpuErrchk(cudaMalloc((void**)&d_sum,     sizeof(int) ));
    gpuErrchk(cudaMemcpy(d_data, h_data, N * sizeof(int), cudaMemcpyHostToDevice));
    test_kernel<<< 1, 1 >>>( d_data, N, d_sum );
    gpuErrchk(cudaDeviceSynchronize());
    gpuErrchk(cudaMemcpy(&sum, d_sum, sizeof(int), cudaMemcpyDeviceToHost));
    printf("Sum: %d", sum );
    free(h_data) ;
    cudaFree(d_data) ;
    cudaFree(d_sum) ;
    return 0;
}

My system:
Windows 10 64bit
CUDA version: 10.1
Thrust version: 1.9.4
Driver version: 419.35
GPU: GeeForce GTX 1050 Ti

@codecircuit
Copy link

@nferenc13 What do you mean by

(dynamic parallelism)

I think currently thrust::device backend is not meant to be used within a device function (all functions with __global__ or __device__ declaration specifier). But using thrust::sort and thrust::reduce with the thrust::seq backend works for me in device code. Moreover I can compile the code snippet you posted.

@griwes
Copy link
Collaborator

griwes commented Apr 15, 2019

This should be fixed on master now; please reopen if that's not the case. I don't yet know what CUDA Toolkit version the fix will land in.

@griwes griwes closed this as completed Apr 15, 2019
@WilliamKF
Copy link

Same issue with thrust::count() on Cuda V9.2.88.

brycelelbach pushed a commit that referenced this issue May 16, 2020
Bug 2422333
Bug 2522259
Bug 2528822
Github #949
Github #973
@samuelpmish
Copy link

#include <vector>

#include <thrust/sort.h>
#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>
#include <thrust/iterator/counting_iterator.h>

int main() {

  std::vector< uint32_t > keys = {3,1,6,3,2,7,3,2,4,7,1,3,2};
  std::vector< uint32_t > values = {1,7,8,3,2,3,1,3,2,1,4,2,4};
  thrust::stable_sort_by_key(thrust::host, keys.begin(), keys.end(), values.begin());

  std::vector< uint32_t > offsets(10, 0);
  thrust::lower_bound(thrust::host,
    keys.begin(),
    keys.end(),
    thrust::counting_iterator< uint32_t >(0),
    thrust::counting_iterator< uint32_t >(10),
    offsets.begin()
  );

  return 0;

}

these two calls to thrust are printing about 400 lines of warnings!

...
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include\thrust/system/detail/generic/binary_search.inl(67): warning : calling a __host__ function from a __host__ __device__ function is not allowed
...

MSVC 19.24.28314.0 + CUDA 11.0.16 on Windows 10

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants