Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

CUDA 9.2 does not support -dc #924

Closed
henryiii opened this issue Jul 23, 2018 · 10 comments
Closed

CUDA 9.2 does not support -dc #924

henryiii opened this issue Jul 23, 2018 · 10 comments
Assignees
Labels
nvbug Has an associated internal NVIDIA NVBug. type: bug: functional Does not work as intended.

Comments

@henryiii
Copy link

henryiii commented Jul 23, 2018

CUDA 9.2's version of thrust does device side kernel launches, which breaks the following MWE:

#include <thrust/functional.h>
// Only needed for older Thrust:
#include <thrust/transform_reduce.h>

#include <iostream>

class ReturnOne : public thrust::unary_function<thrust::tuple<int>, double> {
  public:
    __device__ double operator()(thrust::tuple<int> t) {
        return 1.0;
    }
};


int main() {
    thrust::counting_iterator<int> eventIndex(0);

    double ret = thrust::transform_reduce(
            thrust::make_zip_iterator(thrust::make_tuple(eventIndex)),
            thrust::make_zip_iterator(thrust::make_tuple(eventIndex + 1000)),
            ReturnOne(),
            0.0,
            thrust::plus<double>());

    std::cout << ret << std::endl;
    return 0;
}

Now if you try to compile and link using relocatable device code -dc:

[me@pc cuda9]$ nvcc simplethrust.cu --generate-code=arch=compute_60,code=sm_60 -dc
/usr/local/cuda-9.2/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce.cuh(446): error: cannot pass an argument with a user-provided copy-constructor to a device-side kernel launch
          detected during:
            instantiation of "cudaError_t thrust::cuda_cub::cub::DispatchReduce<InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT, OutputT>::InvokeSingleTile<ActivePolicyT,SingleTileKernelT>(SingleTileKernelT) [with InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<double, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, ReturnOne>, OutputIteratorT=double *, OffsetT=int, ReductionOpT=thrust::plus<double>, OutputT=double, ActivePolicyT=thrust::cuda_cub::cub::DeviceReducePolicy<double, int, thrust::plus<double>>::Policy130, SingleTileKernelT=void (*)(thrust::cuda_cub::transform_input_iterator_t<double, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, ReturnOne>, double *, int, thrust::plus<double>, double)]"
(599): here
            instantiation of "cudaError_t thrust::cuda_cub::cub::DispatchReduce<InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT, OutputT>::Invoke<ActivePolicyT>() [with InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<double, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, ReturnOne>, OutputIteratorT=double *, OffsetT=int, ReductionOpT=thrust::plus<double>, OutputT=double, ActivePolicyT=thrust::cuda_cub::cub::DeviceReducePolicy<double, int, thrust::plus<double>>::Policy130]"
/usr/local/cuda-9.2/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/cub/block/../iterator/../util_device.cuh(332): here
            instantiation of "cudaError_t thrust::cuda_cub::cub::ChainedPolicy<PTX_VERSION, PolicyT, PolicyT>::Invoke(int, FunctorT &) [with PTX_VERSION=130, PolicyT=thrust::cuda_cub::cub::DeviceReducePolicy<double, int, thrust::plus<double>>::Policy130, FunctorT=thrust::cuda_cub::cub::DispatchReduce<thrust::cuda_cub::transform_input_iterator_t<double, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, ReturnOne>, double *, int, thrust::plus<double>, double>]"
/usr/local/cuda-9.2/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/cub/block/../iterator/../util_device.cuh(315): here
            instantiation of "cudaError_t thrust::cuda_cub::cub::ChainedPolicy<PTX_VERSION, PolicyT, PrevPolicyT>::Invoke(int, FunctorT &) [with PTX_VERSION=200, PolicyT=thrust::cuda_cub::cub::DeviceReducePolicy<double, int, thrust::plus<double>>::Policy200, PrevPolicyT=thrust::cuda_cub::cub::DeviceReducePolicy<double, int, thrust::plus<double>>::Policy130, FunctorT=thrust::cuda_cub::cub::DispatchReduce<thrust::cuda_cub::transform_input_iterator_t<double, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, ReturnOne>, double *, int, thrust::plus<double>, double>]"
/usr/local/cuda-9.2/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/cub/block/../iterator/../util_device.cuh(315): here
            instantiation of "cudaError_t thrust::cuda_cub::cub::ChainedPolicy<PTX_VERSION, PolicyT, PrevPolicyT>::Invoke(int, FunctorT &) [with PTX_VERSION=300, PolicyT=thrust::cuda_cub::cub::DeviceReducePolicy<double, int, thrust::plus<double>>::Policy300, PrevPolicyT=thrust::cuda_cub::cub::DeviceReducePolicy<double, int, thrust::plus<double>>::Policy200, FunctorT=thrust::cuda_cub::cub::DispatchReduce<thrust::cuda_cub::transform_input_iterator_t<double, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, ReturnOne>, double *, int, thrust::plus<double>, double>]"
/usr/local/cuda-9.2/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/cub/block/../iterator/../util_device.cuh(315): here
            [ 3 instantiation contexts not shown ]
            instantiation of "cudaError_t thrust::cuda_cub::cub::DeviceReduce::Reduce(void *, size_t &, InputIteratorT, OutputIteratorT, int, ReductionOpT, T, cudaStream_t, __nv_bool) [with InputIteratorT=thrust::cuda_cub::transform_input_iterator_t<double, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, ReturnOne>, OutputIteratorT=double *, ReductionOpT=thrust::plus<double>, T=double]"
/usr/local/cuda-9.2/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/reduce.h(950): here
            instantiation of "T thrust::cuda_cub::reduce_n(thrust::cuda_cub::execution_policy<Derived> &, InputIt, Size, T, BinaryOp) [with Derived=thrust::cuda_cub::tag, InputIt=thrust::cuda_cub::transform_input_iterator_t<double, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, ReturnOne>, Size=std::ptrdiff_t, T=double, BinaryOp=thrust::plus<double>]"
/usr/local/cuda-9.2/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/transform_reduce.h(62): here
            instantiation of "T thrust::cuda_cub::transform_reduce(thrust::cuda_cub::execution_policy<Derived> &, InputIt, InputIt, TransformOp, T, ReduceOp) [with Derived=thrust::cuda_cub::tag, InputIt=thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, TransformOp=ReturnOne, T=double, ReduceOp=thrust::plus<double>]"
/usr/local/cuda-9.2/bin/../targets/x86_64-linux/include/thrust/detail/transform_reduce.inl(47): here
            instantiation of "OutputType thrust::transform_reduce(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, InputIterator, UnaryFunction, OutputType, BinaryFunction) [with DerivedPolicy=thrust::cuda_cub::tag, InputIterator=thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, UnaryFunction=ReturnOne, OutputType=double, BinaryFunction=thrust::plus<double>]"
/usr/local/cuda-9.2/bin/../targets/x86_64-linux/include/thrust/detail/transform_reduce.inl(67): here
            instantiation of "OutputType thrust::transform_reduce(InputIterator, InputIterator, UnaryFunction, OutputType, BinaryFunction) [with InputIterator=thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, UnaryFunction=ReturnOne, OutputType=double, BinaryFunction=thrust::plus<double>]"
simplethrust.cu(18): here

1 error detected in the compilation of "/tmp/tmpxft_00009161_00000000-6_simplethrust.cpp1.ii".

Running the exact same thing in CUDA 8 (or CUDA 9.0) works:

[me@pc cuda8]$ nvcc simplethrust.cu --generate-code=arch=compute_60,code=sm_60 -dc
[me@pc cuda8]$ nvcc --generate-code=arch=compute_60,code=sm_60 simplethrust.o
[me@pc cuda8]$ ./a.out
1000

I believe this is the reason behind GooFit's failure to run on CUDA 9+, since the entire framework is built on these zipped iterators.

@3gx
Copy link
Contributor

3gx commented Jul 23, 2018

This repro also works fine with cuda 9.0

$ /usr/local/cuda-9.0/bin/nvcc r.cu --generate-code=arch=compute_60,code=sm_60 -rdc=true -run                                                                                      
1000

@henryiii henryiii changed the title CUDA 9 does not support -dc CUDA 9.2 does not support -dc Jul 23, 2018
@brycelelbach brycelelbach self-assigned this Aug 6, 2018
@brycelelbach brycelelbach added the type: bug: functional Does not work as intended. label Aug 6, 2018
@brycelelbach
Copy link
Collaborator

Seems to be specific to reduce.

@brycelelbach
Copy link
Collaborator

Ah, I found the problem.

Our normal trick is to do the following:

#if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__>= 350 && defined(__CUDACC_RDC__))
  // Not compiling device code or the device-side runtime is available.
  #define THRUST_HAS_CUDART 1
  #define THRUST_RUNTIME_FUNCTION __host__ __device__
#elif
  #define THRUST_HAS_CUDART 0
  #define THRUST_RUNTIME_FUNCTION __host__
#endif

namespace detail {

THRUST_RUNTIME_FUNCTION
void foo_impl()
{
  launch_kernels<<</* ... */>>>();
}

} // namespace detail

// API function

#pragma nv_exec_check_disable
// Silence warnings about calling `__host__` only functions
// from the device. These warnings get triggered if the device
// side runtime is not available because we never #ifdef out 
// calls to functions that launch kernels, because if we do NVCC
// will discard the instantiations of those kernels. See also:
// https://github.com/NVlabs/cub/issues/30
__host__ __device__
void foo()
{
  if (THRUST_HAS_CUDART)
    foo_impl();

  // Otherwise:
  #if !THRUST_HAS_CUDART
    foo_fallback();
  #endif

In CUDA 9.2, reduce_impl was accidentally inlined into reduce, which causes the breakage.

@brycelelbach
Copy link
Collaborator

This is fixed now.

@brycelelbach
Copy link
Collaborator

Confirmed, this is the source of the GooFit issue. A fix will be in the "next next" CUDA release, but we'll have a fix out on GitHub before then.

This is NVBug 2096679 and 2315990.

@brycelelbach brycelelbach added the nvbug Has an associated internal NVIDIA NVBug. label Aug 6, 2018
@brycelelbach brycelelbach added this to the Next Next Release milestone Aug 6, 2018
@brycelelbach
Copy link
Collaborator

And I've also filed a bug to increase our test coverage to include -dc compilation.

@henryiii
Copy link
Author

henryiii commented Aug 7, 2018

Great, thanks! And more test coverage is always good, especially when it covers the way we use Thrust. :)

@robertmaynard
Copy link
Collaborator

@brycelelbach

So I ran into this problem as we move over to separable compilation and what are our options if we are required to use CUDA 9.2/10.X?

@henryiii
Copy link
Author

henryiii commented Oct 23, 2018

@robertmaynard In GooFit, both using CMake's CUDA as a language and FindCUDA1, I make sure this github submodule is higher in the path than the CUDA thrust, then it works fine. The built-in Thrust does not work with CUDA 9+ and separable compilation.

https://github.com/GooFit/GooFit/blob/14097644f8ad49fb0ea0aa62c7bd53c58db26a1a/CMakeLists.txt#L443-L447

Footnotes

  1. I don't really test CUDA 9+ and CMake < 3.12 that often, though, so I could be wrong about the FindCUDA being set up properly for this case.

@robertmaynard
Copy link
Collaborator

My concern is that as a library I can't force consumers ( or consumers other dependencies ) to use my version of thrust and we will either introduce ODR violations, or make these compiler errors header order dependent.

The more I think about this the more I am wondering if this is fixable by using a custom thrust execution policy and an explicit override for reduce/transform_reduce.

brycelelbach added a commit that referenced this issue Feb 28, 2019
- Fix dispatch for the CUDA backend's `reduce` to use two functions (one with the pragma for disabling exec checks, one with THRUST_RUNTIME_FUNCTION) instead of one. This fixes a regression with device compilation that started in CUDA 9.2
- Fully namespace qualify uses of things in the `thrust::detail` namespace to avoid ambiguities.
Review: Internal GitLab #888
Signed-off-by: Jared Hoberock <[email protected]>
Bug 2096679
Bug 2351990
GitHub #924
git-commit 412c623f939fd676ee619c93f2ca478a6046c611
git-author Bryce Adelstein Lelbach aka wash <[email protected]>
VDVS: http://ausdvs.nvidia.com/Build_Results?virtualId=1000216448&which_page=current_build

Jobs: 2096679-2006
[git-p4: depot-paths = "//sw/gpgpu/thrust/": change = 24706499]
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
nvbug Has an associated internal NVIDIA NVBug. type: bug: functional Does not work as intended.
Projects
None yet
Development

No branches or pull requests

4 participants