Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_NO_BINARY_FOR_GPU #289

Closed
khaotik opened this issue Nov 22, 2016 · 5 comments
Closed

CUDA_ERROR_NO_BINARY_FOR_GPU #289

khaotik opened this issue Nov 22, 2016 · 5 comments

Comments

@khaotik
Copy link
Contributor

khaotik commented Nov 22, 2016

Trying to switch to the new GPU backend for Theano, however most tests for libgpuarray failed.

Debugging through the code:

In file libgpuarray/src/gpuarray_buffer_cuda.c:

In function static gpukernel *cuda_newkernel(..):

calling to cuModuleLoadData returned code 209, which is CUDA_ERROR_NO_BINARY_FOR_GPU in my cuda.h.

OS: Kali Linux Rolling
CPU Core i5-3210M (64 bit)
GPU GeForce GT650m compute capability 3.0
gcc 4.9.3
CUDA Toolkit 8.0
NVIDIA driver: 367.44

I'm unsure about the cause from this point on. Theano runs OK with the old backend.

UPDATE This and this case is similar to mine.

UPDATE 2 this is the generated kernel source from running check_elemwise test:

#define local_barrier() __syncthreads()
#define WITHIN_KERNEL extern "C" __device__
#define KERNEL extern "C" __global__
#define GLOBAL_MEM /* empty */
#define LOCAL_MEM __shared__
#define LOCAL_MEM_ARG /* empty */
#define REQD_WG_SIZE(X,Y,Z) __launch_bounds__(X*Y, Z)
#ifdef NAN
#undef NAN
#endif
#define NAN __int_as_float(0x7fffffff)
#define LID_0 threadIdx.x
#define LID_1 threadIdx.y
#define LID_2 threadIdx.z
#define LDIM_0 blockDim.x
#define LDIM_1 blockDim.y
#define LDIM_2 blockDim.z
#define GID_0 blockIdx.x
#define GID_1 blockIdx.y
#define GID_2 blockIdx.z
#define GDIM_0 gridDim.x
#define GDIM_1 gridDim.y
#define GDIM_2 gridDim.z
#define ga_bool unsigned char
#define ga_byte signed char
#define ga_ubyte unsigned char
#define ga_short short
#define ga_ushort unsigned short
#define ga_int int
#define ga_uint unsigned int
#define ga_long long long
#define ga_ulong unsigned long long
#define ga_float float
#define ga_double double
#define ga_half ga_ushort
#define ga_size size_t
#define ga_ssize ptrdiff_t
#define load_half(p) __half2float(*(p))
#define store_half(p, v) (*(p) = __float2half_rn(v))
#define GA_DECL_SHARED_PARAM(type, name)
#define GA_DECL_SHARED_BODY(type, name) extern __shared__ type name[];
#define GA_WARP_SIZE warpSize
#line 1

KERNEL void elem(const ga_size n, GLOBAL_MEM ga_uint *a_p,  const ga_size a_offset, GLOBAL_MEM ga_uint *b_p,  const ga_size b_offset, GLOBAL_MEM ga_uint *c_p,  const ga_size c_offset) {
const ga_size idx = LDIM_0 * GID_0 + LID_0;
const ga_size numThreads = LDIM_0 * GDIM_0;
ga_size i;
GLOBAL_MEM char *tmp;

tmp = (GLOBAL_MEM char *)a_p;tmp += a_offset; a_p = (GLOBAL_MEM ga_uint *)tmp;tmp = (GLOBAL_MEM char *)b_p;tmp += b_offset; b_p = (GLOBAL_MEM ga_uint *)tmp;tmp = (GLOBAL_MEM char *)c_p;tmp += c_offset; c_p = (GLOBAL_MEM ga_uint *)tmp;for (i = idx; i < n; i += numThreads) {
ga_uint a;
a = a_p[i];
ga_uint b;
b = b_p[i];
ga_uint c;
c = a + b;
c_p[i] = c;
}
}

I'm able to compile the above kernel by using nvcc from command line.

UPDATE 3 This is the generated PTX assembly of above kernel after calling call_compiler:

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-21124049
// Cuda compilation tools, release 8.0, V8.0.44
// Based on LLVM 3.4svn
//

.version 5.0
.target sm_30
.address_size 64

	// .globl	elem

.visible .entry elem(
	.param .u64 elem_param_0,
	.param .u64 elem_param_1,
	.param .u64 elem_param_2,
	.param .u64 elem_param_3,
	.param .u64 elem_param_4,
	.param .u64 elem_param_5,
	.param .u64 elem_param_6
)
{
	.reg .pred 	%p<3>;
	.reg .b32 	%r<10>;
	.reg .b64 	%rd<23>;


	ld.param.u64 	%rd8, [elem_param_0];
	ld.param.u64 	%rd9, [elem_param_1];
	ld.param.u64 	%rd10, [elem_param_2];
	ld.param.u64 	%rd11, [elem_param_3];
	ld.param.u64 	%rd12, [elem_param_4];
	ld.param.u64 	%rd13, [elem_param_5];
	ld.param.u64 	%rd14, [elem_param_6];
	mov.u32 	%r1, %ntid.x;
	mov.u32 	%r2, %ctaid.x;
	mov.u32 	%r3, %tid.x;
	mad.lo.s32 	%r4, %r2, %r1, %r3;
	cvt.u64.u32	%rd22, %r4;
	mov.u32 	%r5, %nctaid.x;
	mul.lo.s32 	%r6, %r5, %r1;
	cvt.u64.u32	%rd2, %r6;
	cvta.to.global.u64 	%rd15, %rd9;
	add.s64 	%rd3, %rd15, %rd10;
	cvta.to.global.u64 	%rd16, %rd11;
	add.s64 	%rd4, %rd16, %rd12;
	cvta.to.global.u64 	%rd17, %rd13;
	add.s64 	%rd5, %rd17, %rd14;
	setp.ge.u64	%p1, %rd22, %rd8;
	@%p1 bra 	BB0_2;

BB0_1:
	shl.b64 	%rd18, %rd22, 2;
	add.s64 	%rd19, %rd3, %rd18;
	add.s64 	%rd20, %rd4, %rd18;
	ld.global.u32 	%r7, [%rd20];
	ld.global.u32 	%r8, [%rd19];
	add.s32 	%r9, %r7, %r8;
	add.s64 	%rd21, %rd5, %rd18;
	st.global.u32 	[%rd21], %r9;
	add.s64 	%rd22, %rd22, %rd2;
	setp.lt.u64	%p2, %rd22, %rd8;
	@%p2 bra 	BB0_1;

BB0_2:
	ret;
}

I saved the file as test.ptx. When trying to assemble with ptxas:

$ nvcc --cubin test.ptx
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress
ptxas fatal   : SM version specified by .target is higher than default SM version assumed
$ ptxas test.ptx
ptxas fatal   : SM version specified by .target is higher than default SM version assumed
$ ptxas -arch=compute_30 test.ptx
# <no error>

This is quite weird. I wonder is it just me or some else also have the problem.

@nouiz
Copy link
Member

nouiz commented Nov 22, 2016

How did you start the tests? You must specify which device to run on:

DEVICE=cuda0 make test

On Tue, Nov 22, 2016 at 1:52 AM, khaotik [email protected] wrote:

Trying to switch to the new GPU backend for Theano, however most tests for
libgpuarray failed.

Debugging through the code:

In file libgpuarray/src/gpuarray_buffer_cuda.c:

In function static gpukernel *cuda_newkernel(..):

calling
https://github.com/Theano/libgpuarray/blob/master/src/gpuarray_buffer_cuda.c
to cuModuleLoadData returned code 209, which is
CUDA_ERROR_NO_BINARY_FOR_GPU in my cuda.h.

CPU Core i5-3210M (64 bit)
GPU GeForce GT650m compute capability 3.0
gcc 4.9.3
CUDA Toolkit 8.0
NVIDIA driver: 367.44

I'm unsure about the cause from this point on. Theano runs OK with the old
backend.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#289, or mute the thread
https://github.com/notifications/unsubscribe-auth/AALC-w5JuyTghA_mMSLzs3D-izyPH4ucks5rApErgaJpZM4K5EOy
.

@khaotik
Copy link
Contributor Author

khaotik commented Nov 22, 2016

@nouiz I'm already doing that. My command line is:

env DEVICE=cuda optirun ./check_elemwise

optirun enables GPU using bumblebee, for which I think is not related because the old backend works.

@abergeron
Copy link
Member

Most often, NO_BINARY_FOR_GPU, means that there is a syntax error or some other problem while compiling.

The C tests want a full device (as in DEVICE=cuda0, not DEVICE=cuda).

Also, if you use the distribution-packaged drivers they tend to not work properly with cuda.

To see what the real problem is, we should switch from cuLoadModule to cuLoadModuleEx and use the additional parameters to get the build logs.

@khaotik
Copy link
Contributor Author

khaotik commented Nov 22, 2016

@abergeron Modified code to cuLoadModuleEx to get error messages. Turns out I lacked libnvidia-ptxjitcompiler.so. Problem solved. Still, it would be much better if these error messages can be shown to the user.

@abergeron
Copy link
Member

If you want to make a PR with your code to show those messages (when compiled with DEBUG). I would be interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants