Skip to content
This repository has been archived by the owner on May 27, 2021. It is now read-only.

Shared memory + multiple function exits cause invalid results #4

Closed
maleadt opened this issue Sep 16, 2016 · 33 comments
Closed

Shared memory + multiple function exits cause invalid results #4

maleadt opened this issue Sep 16, 2016 · 33 comments
Assignees

Comments

@maleadt
Copy link
Member

maleadt commented Sep 16, 2016

Cause seems to be an added checkbounds, if that even makes sense.

Repro:

using CUDAdrv, CUDAnative

@target ptx function kernel(arr::Ptr{Int32})
    temp = @cuStaticSharedMem(Int32, (2, 1))
    tx = Int(threadIdx().x)

    if tx == 1
        for i = 1:2
            # THIS BREAKS STUFF: checkbounds(temp, i)
            Base.pointerset(temp.ptr, 1, i, 8)
        end
    end
    sync_threads()

    Base.pointerset(arr, Base.pointerref(temp.ptr, tx, 8), tx, 8)

    return nothing
end

dev = CuDevice(0)
ctx = CuContext(dev)

d_arr = CuArray(Int32, (2, 1))
@cuda (1,2) kernel(d_arr.ptr)
println(Array(d_arr))

destroy(ctx)

Result without checkbounds: [1; 1]. With: [1; 0].

cc @cfoket

@maleadt maleadt self-assigned this Sep 16, 2016
@maleadt maleadt added the bug label Sep 16, 2016
@maleadt
Copy link
Member Author

maleadt commented Oct 5, 2016

Not reproducible with this code anymore, but rodinia/lud.jl still fails with --check-bounds=yes probably still caused by the same underlying issue.

@maleadt
Copy link
Member Author

maleadt commented Oct 26, 2016

New repro, again using shared memory + bounds checking, but this time the invalid value is the result of __shfl_down (not touching shared memory at all):

using CUDAdrv, CUDAnative

function kernel(ptr::Ptr{Cint})
    shared = @cuStaticSharedMem(Cint, 4)

    lane = (threadIdx().x-1) % warpsize

    if lane == 0
        @boundscheck Base.checkbounds(shared, threadIdx().x)
        unsafe_store!(shared.ptr, 0, threadIdx().x)
    end

    sync_threads()

    val = shfl_down(Cint(32), 1, 4)
    if lane == 0
        unsafe_store!(ptr, val)
    end

    return
end

dev = CuDevice(0)
ctx = CuContext(dev)

gpu_val = CuArray(Cint, 1)
@cuda dev (1,4) kernel(gpu_val.ptr)
val = Array(gpu_val)[1]
println(val)

destroy(ctx)

Returns 0 with checkbounds, 32 without.

@maleadt maleadt changed the title Shared memory changes not visible for all threads Bounds checking + shared memory sometimes causes invalid results Oct 26, 2016
@maleadt
Copy link
Member Author

maleadt commented Oct 26, 2016

Managed to reduce to two sets of LLVM IR, executed using the following snippet:

using CUDAdrv, CUDAnative, LLVM

dev = CuDevice(0)
ctx = CuContext(dev)

for ir_fn in ["bug-working.ll", "bug-broken.ll"]
    gpu_val = CuArray(Cint[42])

    ir = readstring(ir_fn)
    mod = parse(LLVM.Module, ir)
    fn = "kernel"
    entry = get(functions(mod), "kernel")
    ptx = CUDAnative.mcgen(mod, entry, v"3.0")

    cuda_mod = CuModule(ptx)
    cuda_fun = CuFunction(cuda_mod, fn)

    cudacall(cuda_fun, 1, 4, (Ptr{Cint},), gpu_val.ptr)

    val = Array(gpu_val)[1]
    println(val)
end

destroy(ctx)

Working IR:

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4

define void @kernel(i32*) {
top:
  %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = and i32 %1, 31
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %lane0_boundscheck, label %sync_shfl

lane0_boundscheck:
  %4 = icmp ugt i32 %1, 3
  br i1 %4, label %lane0_oob, label %lane0_shmem

lane0_oob:
  tail call void @llvm.trap()
  unreachable

sync_shfl:
  tail call void @llvm.nvvm.barrier0()
  %5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
  br i1 %3, label %lane0_writeback, label %end

lane0_shmem:
  %6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
  store i32 0, i32 addrspace(3)* %6, align 8
  br label %sync_shfl

lane0_writeback:
  store i32 %5, i32* %0, align 8
  br label %end

end:
  ret void
}

declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()

Broken IR:

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4

define void @kernel(i32*) {
top:
  %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = and i32 %1, 31
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %lane0_boundscheck, label %sync_shfl

lane0_boundscheck:
  %4 = icmp ugt i32 %1, 3
  br i1 %4, label %lane0_oob, label %lane0_shmem

sync_shfl:
  tail call void @llvm.nvvm.barrier0()
  %5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
  br i1 %3, label %lane0_writeback, label %end

lane0_oob:
  tail call void @llvm.trap()
  unreachable

lane0_shmem:
  %6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
  store i32 0, i32 addrspace(3)* %6, align 8
  br label %sync_shfl

lane0_writeback:
  store i32 %5, i32* %0, align 8
  br label %end

end:
  ret void
}

declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()

That's right, the only difference between those two is the placement of the oob BB...
cc @cfoket

@maleadt
Copy link
Member Author

maleadt commented Oct 26, 2016

One layer deeper...

Working PTX:

.version 3.2
.target sm_30
.address_size 64

.visible .entry kernel(
        .param .u64 output  // single int output
)
{
        .reg .pred      %p<4>;
        .reg .b32       %r<6>;
        .reg .b64       %rd<6>;
        .shared .align 4 .b8 shmem[16];     // 4 integers
        ld.param.u64    %rd1, [output];

        // calculate lane, check if 0
        mov.u32 %r1, %tid.x;
        and.b32         %r2, %r1, 31;
        setp.ne.s32     %p1, %r2, 0;
        @%p1 bra        BB_SHFL;

        // bounds check for shmem access
        setp.lt.u32     %p2, %r1, 4;
        @%p2 bra        BB_SHMEM;
        bra.uni         BB_OOB;
BB_SHMEM:
        mul.wide.s32    %rd2, %r1, 4;
        mov.u64         %rd3, shmem;
        add.s64         %rd4, %rd3, %rd2;
        mov.u32         %r4, 0;
        st.shared.u32   [%rd4], %r4;
BB_SHFL:
        setp.eq.s32     %p3, %r2, 0;
        bar.sync        0;
        mov.u32         %r5, 32;
        shfl.down.b32 %r3, %r5, 1, 7199;
        @%p3 bra        BB_WRITEBACK;
        bra.uni         BB_END;
BB_WRITEBACK:
        cvta.to.global.u64      %rd5, %rd1;
        st.global.u32   [%rd5], %r3;
BB_END:
        ret;
BB_OOB:
        trap;
}

Broken PTX:

.version 3.2
.target sm_30
.address_size 64

.visible .entry kernel(
        .param .u64 output  // single int output
)
{
        .reg .pred      %p<4>;
        .reg .b32       %r<6>;
        .reg .b64       %rd<6>;
        .shared .align 4 .b8 shmem[16];     // 4 integers
        ld.param.u64    %rd1, [output];

        // calculate lane, check if 0
        mov.u32 %r1, %tid.x;
        and.b32         %r2, %r1, 31;
        setp.ne.s32     %p1, %r2, 0;
        @%p1 bra        BB_SHFL;

        // bounds check for shmem access
        setp.gt.u32     %p2, %r1, 3;
        @%p2 bra        BB_OOB;
        bra.uni         BB_SHMEM;
BB_SHMEM:
        mul.wide.s32    %rd2, %r1, 4;
        mov.u64         %rd3, shmem;
        add.s64         %rd4, %rd3, %rd2;
        mov.u32         %r4, 0;
        st.shared.u32   [%rd4], %r4;
BB_SHFL:
        setp.eq.s32     %p3, %r2, 0;
        bar.sync        0;
        mov.u32         %r5, 32;
        shfl.down.b32 %r3, %r5, 1, 7199;
        @%p3 bra        BB_WRITEBACK;
        bra.uni         BB_END;
BB_WRITEBACK:
        cvta.to.global.u64      %rd5, %rd1;
        st.global.u32   [%rd5], %r3;
BB_END:
        ret;
BB_OOB:
        trap;
}

Loader:

using CUDAdrv

dev = CuDevice(0)
ctx = CuContext(dev)

fn = "kernel"

for name in ["bug-working", "bug-broken"]
    gpu_val = CuArray(Cint[42])

    ptx = readstring("$name.ptx")

    cuda_mod = CuModule(ptx)
    cuda_fun = CuFunction(cuda_mod, fn)

    cudacall(cuda_fun, 1, 4, (Ptr{Cint},), gpu_val.ptr)

    val = Array(gpu_val)[1]
    println(val)
end

destroy(ctx)

Only difference: the bounds-check branch (>3 or <4):

$ diff bug-working.ptx bug-broken.ptx                                                                                                                                            *[master] 
22,24c22,24
<         setp.lt.u32     %p2, %r1, 4;
<         @%p2 bra        BB_SHMEM;
<         bra.uni         BB_OOB;
---
>         setp.gt.u32     %p2, %r1, 3;
>         @%p2 bra        BB_OOB;
>         bra.uni         BB_SHMEM;

Probably an assembler bug.

@maleadt
Copy link
Member Author

maleadt commented Oct 26, 2016

Alternative loader, using ptxas to generate a cubin (in order to play with ptxas optimization flags, but doesn't seem to matter):

using CUDAdrv

dev = CuDevice(0)
ctx = CuContext(dev)

fn = "kernel"

for name in ["kernel-working", "kernel-broken"]
    gpu_val = CuArray(Cint[42])

    run(`ptxas -arch=sm_61 -o $name.cubin $name.ptx`)

    cuda_mod = CuModule(read("$name.cubin"))
    cuda_fun = CuFunction(cuda_mod, fn)

    cudacall(cuda_fun, 1, 4, (Ptr{Cint},), gpu_val.ptr)

    val = Array(gpu_val)[1]
    println(val)
end

destroy(ctx)

@maleadt
Copy link
Member Author

maleadt commented Oct 26, 2016

Almost definitely looks like an assembler bug. See the following annotated & prettified Pascal SASS (sm_61):

Working version:

kernel:
.text.kernel:
        MOV R1, c[0x0][0x20];
        S2R R2, SR_TID.X;
        SSY `(BB_SHFL);         // push BB_SHFL on reconvergence stack

        // calculate lane, check if 0
        LOP32I.AND R0, R2, 0x1f;
        ISETP.NE.AND P0, PT, R0, RZ, PT;
    @P0 SYNC                    // not lane 0, pop BB_SHFL from reconvergence stack

        // bounds check for shmem access
        ISETP.LT.U32.AND P0, PT, R2, 0x4, PT;
    @P0 BRA `(BB_SHMEM);

//BB_OOB:
        BPT.TRAP 0x1;
        EXIT;
BB_SHMEM:
        SHL R2, R2, 0x2;
        STS [R2], RZ;
        SYNC                    // pop BB_SHFL from reconvergence stack
BB_SHFL:
        // check if lane 0
  {     ISETP.EQ.AND P0, PT, R0, RZ, PT;
        BAR.SYNC 0x0;        }
        // shuffle unconditionally
        MOV32I R0, 0x20;
        SHFL.DOWN PT, R0, R0, 0x1, 0x1c1f;
   @!P0 EXIT;                  // not lane 0, exit
//BB_WRITEBACK:
        MOV R2, c[0x0][0x140];
        MOV R3, c[0x0][0x144];
        STG.E [R2], R0;
        EXIT;
.BB_END:
        BRA `(.BB_END);

Broken version:

kernel:
.text.kernel:
        MOV R1, c[0x0][0x20];
        S2R R2, SR_TID.X;

        // calculate lane, check if 0
        LOP32I.AND R0, R2, 0x1f;
        ISETP.NE.AND P0, PT, R0, RZ, PT;
    @P0 BRA `(BB_SHFL);         // not lane 0, branch to BB_SHFL

        // bounds check for shmem access
        ISETP.GT.U32.AND P0, PT, R2, 0x3, PT;
    @P0 BRA `(BB_OOB);

//BB_SHMEM:
        SHL R2, R2, 0x2;
        STS [R2], RZ;
BB_SHFL:
        // check if lane 0
 {      ISETP.EQ.AND P0, PT, R0, RZ, PT;
        BAR.SYNC 0x0;        }
        // shuffle unconditionally
        MOV32I R0, 0x20;
        SHFL.DOWN PT, R0, R0, 0x1, 0x1c1f;
   @!P0 EXIT;                  // not lane 0, exit
//BB_WRITEBACK:
        MOV R2, c[0x0][0x140];
        MOV R3, c[0x0][0x144];
        STG.E [R2], R0;
        EXIT;
BB_OOB:
        BPT.TRAP 0x1;
        EXIT;
.L_3:
        BRA `(.L_3);
.L_18:

The broken version clearly messes up its reconvergence stack, not pushing anything on it despite multiple conditional branches (for some info on how this works, see this paper by Bialas and Strzelecki)...

@maleadt
Copy link
Member Author

maleadt commented Oct 26, 2016

And a C++ loader, for reporting purposes.

#include <stdio.h>

#include <cuda.h>

#define CHECK(err) __check(err, __FILE__, __LINE__)
inline void __check(CUresult err, const char *file, const int line) {
  if (CUDA_SUCCESS != err) {
    const char *name, *descr;
    cuGetErrorName(err, &name);
    cuGetErrorString(err, &name);
    fprintf(stderr, "CUDA error #%s: %s at %s:%i\n", name, descr, file, line);
    abort();
  }
}

int test(const char *path) {
  CUmodule mod;
  cuModuleLoad(&mod, path);

  CUfunction fun;
  CHECK(cuModuleGetFunction(&fun, mod, "kernel"));

  int *gpu_val;
  CHECK(cuMemAlloc((CUdeviceptr*) &gpu_val, sizeof(int)));

  void *args[1] = {&gpu_val};
  cuLaunchKernel(fun, 1, 1, 1, 4, 1, 1, 0, NULL, args, NULL);

  int val;
  CHECK(cuMemcpyDtoH(&val, (CUdeviceptr) gpu_val, sizeof(int)));

  CHECK(cuModuleUnload(mod));

  return val;
}

int main() {
  CHECK(cuInit(0));

  CUdevice dev;
  CHECK(cuDeviceGet(&dev, 0));

  CUcontext ctx;
  CHECK(cuCtxCreate(&ctx, 0, dev));

  printf("working: %d\n", test("kernel-working.ptx"));
  printf("broken: %d\n", test("kernel-broken.ptx"));

  CHECK(cuCtxDestroy(ctx));

  return 0;
}

Will probably submit this to NVIDIA soon, unless anybody still spots us doing something wrong.

@maleadt
Copy link
Member Author

maleadt commented Oct 27, 2016

Reported this repro to NVIDIA, bug #1833004. Will disable bounds checking for the time being.

maleadt added a commit that referenced this issue Oct 27, 2016
@vchuravy
Copy link
Member

Could we fix this on the LLVM side? Any bugfix to the assembler is going to be deployed slowly.

@maleadt
Copy link
Member Author

maleadt commented Oct 27, 2016

I haven't figured out what PTX pattern exactly triggers the SASS emission bug. Probably the branch to a trap BB. I've asked NVIDIA for some background on the bug, if they deem it a bug, so I'm going to wait for them to respond before sinking more time into this.

@maleadt
Copy link
Member Author

maleadt commented Nov 10, 2016

Status update from NVIDIA:

The following items have been modified for this bug:
 - Status changed from "Open - pending review" to "Open - Fix being tested"

... but I haven't got access to their bug tracker (I'm only on its CC list), so I can't look at or ask for more details 😕

@jmaebe
Copy link

jmaebe commented Nov 10, 2016

At least you know it is in fact their fault :)

@maleadt
Copy link
Member Author

maleadt commented Jan 6, 2017

The following items have been modified for this bug:

  • Status changed from "Open - Fix being tested" to "Closed - Fixed"

No idea how / starting which version / ... though (still don't allow me access to the bug tracker).

@maleadt
Copy link
Member Author

maleadt commented Apr 27, 2017

Revisited this issue. Seems like it's still there, at least on NVIDIA driver 375.39, but I found out that it only reproduces on sm_61 hardware or newer. I haven't heard back from NVIDIA, so I don't know which driver includes the fix, and the only system with sm_61 hardware I have is locked to driver 375.39...

Anyone with sm_61 hw on more recent drivers care to test this? I've updated the repro scripts too.

@vchuravy
Copy link
Member

I only have access to sm_60, but I could test it on that.

@maleadt
Copy link
Member Author

maleadt commented Apr 27, 2017

Great! Please send me the output (verify the bug is still there), SASS files generated by ptx.jl (remove existing ones first), and the driver version. No hurry though, it's not like we can do much about it. But given some extra data points, it might be possible to re-enable bounds checking...

@vchuravy
Copy link
Member

vchuravy commented Apr 28, 2017 via email

@vchuravy vchuravy self-assigned this May 8, 2017
@maleadt
Copy link
Member Author

maleadt commented Jul 3, 2017

Bug still there on 375.66 (current long-lived).

@jlebar
Copy link

jlebar commented Oct 27, 2017

It looks like you've discovered https://bugs.llvm.org/show_bug.cgi?id=27738, or something related. Unfortunately we've gotten zero movement from nvidia on this in the ~1.5 years since we discovered it ourselves and brought it to their attention. It's possible that CUDA 9's ptxas will be better, but I don't expect a proper fix except inasmuch as "buy a Volta card and use the new sync intrinsics" is a fix.

Yours is the cleanest reduction of this bug I've seen, btw.

@jlebar
Copy link

jlebar commented Mar 29, 2018

FYI, @timshen91 is rolling out an incomplete fix for this in LLVM, and working on the full fix. He'll post details in the bug.

Empirically, the partial fix he has in hand fixes this problem for everything we've seen on our end. We'd be curious to hear if it fixes anything for you all.

@maleadt
Copy link
Member Author

maleadt commented Mar 30, 2018

Oh cool, thanks for the ping! I'll have a look about reproducing, since it's a while ago since I last looked at this. We also mentioned this issue to NVIDIA and they were going to look into giving us more info; if that happens I'll update here.

@timshen91
Copy link

The partial fix is https://reviews.llvm.org/D45008 and https://reviews.llvm.org/D45070. Once they are committed, I'll update with the revision number that needs to be sync'ed pass.

@timshen91
Copy link

Any LLVM who's revision is larger than or equal to r328885 should include my partial fix.

I tried to use 367.48 nvcc and ptxas (but with newer driver) to reproduce the bug but failed. I'll wait for @maleadt for a short period of time and see what will happen. :)

@maleadt
Copy link
Member Author

maleadt commented Apr 3, 2018

Similarly, I had to revert to 375.66, as I could not reproduce the issue on 384.111 (Debian stable BPO).

Testing on r329021, it seems like the bug is still there though (on sm_61).
I'll recreate a full non-Julia MWE here so that you can test for yourself:

working.ll:

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4

define void @kernel(i32*) {
top:
  %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = and i32 %1, 31
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %lane0_boundscheck, label %sync_shfl

lane0_boundscheck:
  %4 = icmp ugt i32 %1, 3
  br i1 %4, label %lane0_oob, label %lane0_shmem

lane0_oob:
  tail call void @llvm.trap()
  unreachable

sync_shfl:
  tail call void @llvm.nvvm.barrier0()
  %5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
  br i1 %3, label %lane0_writeback, label %end

lane0_shmem:
  %6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
  store i32 0, i32 addrspace(3)* %6, align 8
  br label %sync_shfl

lane0_writeback:
  store i32 %5, i32* %0, align 8
  br label %end

end:
  ret void
}

declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()

!nvvm.annotations = !{!0}
!0 = !{void (i32*)* @kernel, !"kernel", i32 1}

broken.ll (only difference is the ordering of the lane0_oob and sync_shfl BBs):

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@shmem = internal addrspace(3) global [4 x i32] zeroinitializer, align 4

define void @kernel(i32*) {
top:
  %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = and i32 %1, 31
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %lane0_boundscheck, label %sync_shfl

lane0_boundscheck:
  %4 = icmp ugt i32 %1, 3
  br i1 %4, label %lane0_oob, label %lane0_shmem

sync_shfl:
  tail call void @llvm.nvvm.barrier0()
  %5 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 32, i32 1, i32 7199)
  br i1 %3, label %lane0_writeback, label %end

lane0_oob:
  tail call void @llvm.trap()
  unreachable

lane0_shmem:
  %6 = getelementptr [4 x i32], [4 x i32] addrspace(3)* @shmem, i32 0, i32 %1
  store i32 0, i32 addrspace(3)* %6, align 8
  br label %sync_shfl

lane0_writeback:
  store i32 %5, i32* %0, align 8
  br label %end

end:
  ret void
}

declare void @llvm.trap()
declare i32 @llvm.nvvm.shfl.down.i32(i32, i32, i32)
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
declare void @llvm.nvvm.barrier0()

!nvvm.annotations = !{!0}
!0 = !{void (i32*)* @kernel, !"kernel", i32 1}

Compile to PTX:

llc -mcpu=sm_30 broken.ll -o working.ptx
llc -mcpu=sm_30 broken.ll -o broken.ptx

Loader for PTX code:

#include <stdio.h>

#include <cuda.h>

#define CHECK(err) __check(err, __FILE__, __LINE__)
inline void __check(CUresult err, const char *file, const int line) {
  if (CUDA_SUCCESS != err) {
    const char *name, *descr;
    cuGetErrorName(err, &name);
    cuGetErrorString(err, &name);
    fprintf(stderr, "CUDA error #%s: %s at %s:%i\n", name, descr, file, line);
    abort();
  }
}

int test(const char *path) {
  CUmodule mod;
  CHECK(cuModuleLoad(&mod, path));

  CUfunction fun;
  CHECK(cuModuleGetFunction(&fun, mod, "kernel"));

  int *gpu_val;
  CHECK(cuMemAlloc((CUdeviceptr*) &gpu_val, sizeof(int)));

  void *args[1] = {&gpu_val};
  CHECK(cuLaunchKernel(fun, 1, 1, 1, 4, 1, 1, 0, NULL, args, NULL));

  int val;
  CHECK(cuMemcpyDtoH(&val, (CUdeviceptr) gpu_val, sizeof(int)));

  CHECK(cuModuleUnload(mod));

  return val;
}

int main() {
  CHECK(cuInit(0));

  CUdevice dev;
  CHECK(cuDeviceGet(&dev, 0));

  CUcontext ctx;
  CHECK(cuCtxCreate(&ctx, 0, dev));

  printf("working: %d\n", test("working.ptx"));
  printf("broken: %d\n", test("broken.ptx"));

  CHECK(cuCtxDestroy(ctx));

  return 0;
}

Output:

$ clang++ ptx_loader.cpp -o ptx_loader -lcuda
$ ./ptx_loader
working: 32
broken: 0

Even though the generated PTX does differ between LLVM 6.0 and LLVM ToT (but differs identically wrt. the working or broken versions):

--- working_6.0.ptx        2018-04-03 10:34:01.000000000 +0200
+++ working_ToT.ptx        2018-04-03 09:57:20.000000000 +0200
@@ -39,12 +39,12 @@
        mov.u32         %r5, 32;
        shfl.down.b32 %r3, %r5, 1, 7199;
        @%p3 bra        LBB0_5;
-// %bb.6:                               // %end
-       ret;
+       bra.uni         LBB0_6;
 LBB0_5:                                 // %lane0_writeback
        ld.param.u64    %rd2, [kernel_param_0];
        cvta.to.global.u64      %rd1, %rd2;
        st.global.u32   [%rd1], %r3;
+LBB0_6:                                 // %end
        ret;
 LBB0_2:                                 // %lane0_oob
        trap;

@jlebar
Copy link

jlebar commented Apr 3, 2018

I had to revert to 375.66, as I could not reproduce the issue on 384.111 (Debian stable BPO).

I suspect that this is because the driver contains a copy of ptxas, so changing the driver version changes the ptxas version you're using. If you compiled all the way to SASS for your GPU (dunno if your frontend does this) ahead of time using ptxas, then the driver version shouldn't matter.

I can link you to how we do this in XLA if it'd be helpful.

Will leave the analysis here to @timshen91.

@timshen91
Copy link

I also reproduced the ptxas miscompile on sm_61 with ptxas 8.0. I modified the launcher to call kernel<<<...>>>(...), and link the pre-compiled ptx into the launcher.

It looks like the lane0_oob block breaks the region structure (roughly a single-entry, single-exit set of basic blocks) of the program control flow graph (CFG). It has a trap instruction.

I attempted four different variations:
a) add a ret after trap.
b) add a bra.uni THE_RET_BLOCK after trap.
c) At ptx level, "inline" the trapping block into the predecessor(s).
d) replace the trap with a ret.

(a) and (b) attempted to fix the control flow graph (CFG) region structure, but they didn't work. Both (c) and (d) work, but I can' extract a principled heuristic from (c) or (d). Hopefully the new ptxas fixes this kind of issue(s) once for all.

@maleadt
Copy link
Member Author

maleadt commented Apr 4, 2018

I suspect that this is because the driver contains a copy of ptxas, so changing the driver version changes the ptxas version you're using.

Yeah, I've been deliberately using the driver for this because I assume it to be faster than having to call ptxas (we generate code at run-time, so we care about compiler performance). But with issues like this one, https://github.com/JuliaGPU/CUDAnative.jl/issues/165 (device support of the driver's embedded ptxas not matching that of CUDA's ptxas, despite reporting the same version), and the fact that its not possible to probe the embedded compiler's version in order to work around or guard against bugs like this one, maybe I should consider the manual approach.

It has a trap instruction.

Right, I assume this breaks the structured CFG requirement. I'll just avoid emitting trap for now, thanks for looking into alternatives though.

By the way, any suggestions on similar fatal error reporting mechanisms? trap isn't ideal, both because of this issue, and because it leaves CUDA in an unrecoverable state.
I guess XLA doesn't require such functionality though.

@jlebar
Copy link

jlebar commented Apr 4, 2018

By the way, any suggestions on similar fatal error reporting mechanisms? trap isn't ideal, both because of this issue, and because it leaves CUDA in an unrecoverable state. I guess XLA doesn't require such functionality though.

XLA doesn't require this functionality at the moment, but we have talked about adding an assert/trap instruction to XLA. Our idea for implementing it was to use a global variable. Which is ugly for sure. But I'm not sure how to do the global variable and prevent future kernels from running. That's really what trap is for. I guess we could dereference a null pointer or something, although who knows what ptxas will do when it sees that. :-/

@maleadt
Copy link
Member Author

maleadt commented Aug 29, 2018

Pretty sure I just ran into another occurrence of this bug:

using CUDAnative, CUDAdrv

function cpu(input)
    output = Vector{Cint}(2)
    
    for i in 1:2
        output[i] = input[1]
    end
    
    return output
end

function kernel(input, output, n)
    i = threadIdx().x

    temp = @cuStaticSharedMem(Cint, 1)
    if i == 1
        1 <= n || ccall("llvm.trap", llvmcall, Cvoid, ())
        temp[1] = input
    end
    sync_threads()

    i <= n || ccall("llvm.trap", llvmcall, Cvoid, ())
    unsafe_store!(output, temp[1], i)
end

function gpu(input)
    output_gpu = Mem.alloc(Cint, 2)

    @cuda threads=2 kernel(input, convert(Ptr{eltype(input)}, output_gpu.ptr), 42)

    return Mem.download(Cint, output_gpu, 2)
end


using Test

function main()
    input = rand(Cint(1):Cint(100))

    @test cpu(input) == gpu(input)
end

kernel copies input[1] to output[1] and output[2], but doesn't produce the correct results when those calls to llvm.trap are present (even though the branches are never taken). This is on 396.54 with sm_35.
The following PTX is generated:

//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_35
.address_size 64

// shmem1 has been demoted
                                        // @ptxcall_kernel_1
.visible .entry ptxcall_kernel_1(
        .param .u32 ptxcall_kernel_1_param_0,
        .param .u64 ptxcall_kernel_1_param_1,
        .param .u64 ptxcall_kernel_1_param_2
)
{
        .reg .pred      %p<2>;
        .reg .b32       %r<7>;
        .reg .b64       %rd<4>;
        // demoted variable
        .shared .align 16 .b8 shmem1[4];
// %bb.0:                               // %entry
        ld.param.u64    %rd1, [ptxcall_kernel_1_param_1];
        mov.u32         %r1, %tid.x;
        setp.ne.s32     %p1, %r1, 0;
        @%p1 bra        LBB0_2;
// %bb.1:                               // %L19.i
        ld.param.u32    %r2, [ptxcall_kernel_1_param_0];
        st.shared.u32   [shmem1], %r2;
LBB0_2:                                 // %julia_kernel_37106.exit
        bar.sync        0;
        ld.shared.u32   %r3, [shmem1];
        mul.wide.u32    %rd2, %r1, 4;
        add.s64         %rd3, %rd1, %rd2;
        shr.u32         %r4, %r3, 24;
        st.u8   [%rd3+3], %r4;
        shr.u32         %r5, %r3, 16;
        st.u8   [%rd3+2], %r5;
        shr.u32         %r6, %r3, 8;
        st.u8   [%rd3+1], %r6;
        st.u8   [%rd3], %r3;
        ret;
}
                                        // -- End function
//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_35
.address_size 64

// shmem1 has been demoted
                                        // @ptxcall_kernel_1
.visible .entry ptxcall_kernel_1(
        .param .u32 ptxcall_kernel_1_param_0,
        .param .u64 ptxcall_kernel_1_param_1,
        .param .u64 ptxcall_kernel_1_param_2
)
{
        .reg .pred      %p<4>;
        .reg .b32       %r<7>;
        .reg .b64       %rd<6>;
        // demoted variable
        .shared .align 16 .b8 shmem1[4];
// %bb.0:                               // %entry
        ld.param.u64    %rd3, [ptxcall_kernel_1_param_2];
        mov.u32         %r1, %tid.x;
        setp.ne.s32     %p1, %r1, 0;
        @%p1 bra        LBB0_4;
// %bb.1:                               // %L19.i
        setp.gt.s64     %p2, %rd3, 0;
        @%p2 bra        LBB0_3;
        bra.uni         LBB0_2;
LBB0_3:                                 // %L23.i
        ld.param.u32    %r2, [ptxcall_kernel_1_param_0];
        st.shared.u32   [shmem1], %r2;
LBB0_4:                                 // %L40.i
        cvt.u64.u32     %rd1, %r1;
        bar.sync        0;
        setp.lt.s64     %p3, %rd1, %rd3;
        @%p3 bra        LBB0_6;
        bra.uni         LBB0_5;
LBB0_6:                                 // %julia_kernel_37106.exit
        ld.param.u64    %rd2, [ptxcall_kernel_1_param_1];
        ld.shared.u32   %r3, [shmem1];
        shl.b64         %rd4, %rd1, 2;
        add.s64         %rd5, %rd2, %rd4;
        shr.u32         %r4, %r3, 24;
        st.u8   [%rd5+3], %r4;
        shr.u32         %r5, %r3, 16;
        st.u8   [%rd5+2], %r5;
        shr.u32         %r6, %r3, 8;
        st.u8   [%rd5+1], %r6;
        st.u8   [%rd5], %r3;
        ret;
LBB0_5:                                 // %L44.i
        trap;
LBB0_2:                                 // %L22.i
        trap;
}
                                        // -- End function

Was going to reduce this further (SASS, C++ loader) but these PTX files now seems to hang in both cuModuleLoad and ptxas, not sure what's going on there, but this has cost me enough time already.

@maleadt maleadt changed the title Bounds checking + shared memory sometimes causes invalid results Shared memory + multiple function exits cause invalid results Aug 29, 2018
@jlebar
Copy link

jlebar commented Sep 10, 2018

Ugh, these ptxas bugs are the worst. :(

The ptx LLVM is generating here does not look particularly well-structurized to me, though. It's conceivable that better structurization in LLVM would resolve this. I think @timshen91 had been hoping that the current amount of structurization we apply (really, iirc, it's that we turned off passes that would make the graph less structured) would be sufficient, but maybe you're proving that's not the case.

@timshen91
Copy link

Hi @maleadt,

Do you have the LLVM IR, and possibly the set of LLVM flags used to generate the ptx?

@maleadt
Copy link
Member Author

maleadt commented Sep 10, 2018

Sure. I'll dump as much relevant info as possible. The original high-level source code is as follows:

function kernel(input::Int32, output::Ptr{Int32}, yes::Bool=true)
    i = threadIdx().x

    temp = @cuStaticSharedMem(Cint, 1)
    if i == 1
        yes || no()
        temp[1] = input
    end
    sync_threads()

    yes || no()
    unsafe_store!(output, temp[1], i)
end

function no()
    ccall("llvm.trap", llvmcall, Cvoid, ())
end

That is compiled to the following LLVM IR:

; ModuleID = 'KernelWrapper'
source_filename = "KernelWrapper"
target triple = "nvptx64-nvidia-cuda"

%jl_value_t = type opaque

@shmem1 = addrspace(3) global [1 x i32] zeroinitializer, align 16

define i64 @julia_kernel_36616(i32, i64, i8) local_unnamed_addr {
top:
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !range !2
  %4 = icmp eq i32 %3, 0
  %5 = and i8 %2, 1
  br i1 %4, label %L17, label %L27

L17:                                              ; preds = %top
  %6 = icmp eq i8 %5, 0
  br i1 %6, label %L19, label %L22

L19:                                              ; preds = %L17
  call void @llvm.trap()
  unreachable

L22:                                              ; preds = %L17
  store i32 %0, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
  br label %L27

L27:                                              ; preds = %top, %L22
  call void @llvm.nvvm.barrier0()
  %7 = icmp eq i8 %5, 0
  br i1 %7, label %L30, label %L33

L30:                                              ; preds = %L27
  call void @llvm.trap()
  unreachable

L33:                                              ; preds = %L27
  %8 = load i32, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
  %9 = zext i32 %3 to i64
  %10 = inttoptr i64 %1 to i32*
  %11 = getelementptr inbounds i32, i32* %10, i64 %9
  store i32 %8, i32* %11, align 1, !tbaa !6
  ret i64 %1
}

; Function Attrs: noreturn nounwind
declare void @llvm.trap() #0

; Function Attrs: convergent nounwind
declare void @llvm.nvvm.barrier0() #1

; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #2

define void @ptxcall_kernel_1(i32, i64, i8) local_unnamed_addr {
entry:
  %3 = call i64 @julia_kernel_36616(i32 %0, i64 %1, i8 %2)
  ret void
}

attributes #0 = { noreturn nounwind }
attributes #1 = { convergent nounwind }
attributes #2 = { nounwind readnone }
attributes #3 = { allocsize(1) }

!llvm.module.flags = !{!0}
!nvvm.annotations = !{!1}

!0 = !{i32 1, !"Debug Info Version", i32 3}
!1 = !{void (i32, i64, i8)* @ptxcall_kernel_1, !"kernel", i32 1}
!2 = !{i32 0, i32 1023}
!3 = !{!4, !4, i64 0, i64 0}
!4 = !{!"ptxtbaa_shared", !5, i64 0}
!5 = !{!"ptxtbaa"}
!6 = !{!7, !7, i64 0}
!7 = !{!"jtbaa_data", !8, i64 0}
!8 = !{!"jtbaa"}

Which in turn generates the following PTX:

//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_35
.address_size 64

.visible .shared .align 16 .b8 shmem1[4];
                                        // @julia_kernel_36783
.visible .func  (.param .b64 func_retval0) julia_kernel_36783(
    .param .b32 julia_kernel_36783_param_0,
    .param .b64 julia_kernel_36783_param_1,
    .param .b32 julia_kernel_36783_param_2
)
{
    .reg .pred  %p<4>;
    .reg .b16   %rs<3>;
    .reg .b32   %r<7>;
    .reg .b64   %rd<4>;

// %bb.0:                               // %top
    mov.u32     %r1, %tid.x;
    setp.ne.s32     %p1, %r1, 0;
    ld.param.u8     %rs2, [julia_kernel_36783_param_2];
    and.b16     %rs1, %rs2, 1;
    @%p1 bra    LBB0_4;
// %bb.1:                               // %L17
    setp.ne.s16     %p2, %rs1, 0;
    @%p2 bra    LBB0_3;
    bra.uni     LBB0_2;
LBB0_3:                                 // %L22
    ld.param.u32    %r2, [julia_kernel_36783_param_0];
    st.shared.u32   [shmem1], %r2;
LBB0_4:                                 // %L27
    bar.sync    0;
    setp.ne.s16     %p3, %rs1, 0;
    @%p3 bra    LBB0_6;
    bra.uni     LBB0_5;
LBB0_6:                                 // %L33
    ld.param.u64    %rd1, [julia_kernel_36783_param_1];
    ld.shared.u32   %r3, [shmem1];
    mul.wide.u32    %rd2, %r1, 4;
    add.s64     %rd3, %rd1, %rd2;
    shr.u32     %r4, %r3, 24;
    st.u8   [%rd3+3], %r4;
    shr.u32     %r5, %r3, 16;
    st.u8   [%rd3+2], %r5;
    shr.u32     %r6, %r3, 8;
    st.u8   [%rd3+1], %r6;
    st.u8   [%rd3], %r3;
    st.param.b64    [func_retval0+0], %rd1;
    ret;
LBB0_5:                                 // %L30
    trap;
LBB0_2:                                 // %L19
    trap;
}
                                        // -- End function
    // .globl   ptxcall_kernel_2 // -- Begin function ptxcall_kernel_2
.visible .entry ptxcall_kernel_2(
    .param .u32 ptxcall_kernel_2_param_0,
    .param .u64 ptxcall_kernel_2_param_1,
    .param .u8 ptxcall_kernel_2_param_2
)                                       // @ptxcall_kernel_2
{
    .reg .b32   %r<3>;
    .reg .b64   %rd<3>;

// %bb.0:                               // %entry
    ld.param.u32    %r1, [ptxcall_kernel_2_param_0];
    ld.param.u64    %rd1, [ptxcall_kernel_2_param_1];
    ld.param.u8     %r2, [ptxcall_kernel_2_param_2];
    { // callseq 3, 0
    .reg .b32 temp_param_reg;
    .param .b32 param0;
    st.param.b32    [param0+0], %r1;
    .param .b64 param1;
    st.param.b64    [param1+0], %rd1;
    .param .b32 param2;
    st.param.b32    [param2+0], %r2;
    .param .b64 retval0;
    call.uni (retval0), 
    julia_kernel_36783, 
    (
    param0, 
    param1, 
    param2
    );
    ld.param.b64    %rd2, [retval0+0];
    } // callseq 3
    ret;
}
                                        // -- End function

This code generates broken SASS:

code for sm_35
    Function : ptxcall_kernel_3
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                           
    /*0008*/                   MOV R1, c[0x0][0x44];                       
    /*0010*/                   MOV R0, c[0x0][0x150];                      
    /*0018*/                   LOP32I.AND R0, R0, 0xff;                    
    /*0020*/                   CAL 0x38;                                   
    /*0028*/                   MOV RZ, RZ;                                 
    /*0030*/                   EXIT;                                       
    /*0038*/                   S2R R2, SR_TID.X;                           
                                                                           
    /*0048*/                   BFE.U32 R0, R0, 0x800;                      
    /*0050*/                   SSY 0xc0;                                   
    /*0058*/                   ISETP.NE.AND P0, PT, R2, RZ, PT;            
    /*0060*/                   LOP32I.AND R0, R0, 0x1;                     
    /*0068*/               @P0 BRA 0x98;                                   
    /*0070*/                   ISETP.NE.AND P0, PT, R0, RZ, PT;            
    /*0078*/              @!P0 NOP.S;                                      
                                                                           
    /*0088*/                   MOV R3, c[0x0][0x140];                      
    /*0090*/                   STS [RZ], R3;                               
    /*0098*/                   ISETP.NE.AND P0, PT, R0, RZ, PT;            
    /*00a0*/                   BAR.SYNC 0x0;                               
    /*00a8*/               @P0 BRA 0xd8;                                   
    /*00b0*/                   BPT.TRAP 0x1;                               
    /*00b8*/                   NOP.S;                                      
                                                                           
    /*00c8*/                   BPT.TRAP 0x1;                               
    /*00d0*/                   RET;                                        
    /*00d8*/                   LDS R0, [RZ];                               
    /*00e0*/                   ISCADD R5.CC, R2, c[0x0][0x148], 0x2;       
    /*00e8*/                   MOV32I R3, 0x4;                             
    /*00f0*/                   IMAD.U32.U32.HI.X R3, R2, R3, c[0x0][0x14c];
    /*00f8*/                   MOV R2, R5;                                 
                                                                           
    /*0108*/                   PRMT R4, RZ, 0x7, R0;                       
    /*0110*/                   PRMT R5, RZ, 0x76, R0;                      
    /*0118*/                   PRMT R6, RZ, 0x765, R0;                     
    /*0120*/                   PRMT R7, R0, 0x7610, R7;                    
    /*0128*/                   ST.E.U8 [R2+0x3], R4;                       
    /*0130*/                   ST.E.U8 [R2+0x2], R5;                       
    /*0138*/                   ST.E.U8 [R2+0x1], R6;                       
                                                                           
    /*0148*/                   ST.E.U8 [R2], R7;                           
    /*0150*/                   RET;                                        
    /*0158*/                   BRA 0x158;                                  
    /*0160*/                   NOP;                                        
    /*0168*/                   NOP;                                        
    /*0170*/                   NOP;                                        
    /*0178*/                   NOP;                                        

Interestingly, changing the kernel wrapper to pass a literal 1 (the result of invoking this kernel with 2 arguments, having the yes parameter default to true) results in this 1 embedded in the LLVM IR and PTX (which didn't influence optimization since we don't inline at the LLVM level):

define void @ptxcall_kernel_1(i32, i64) local_unnamed_addr {
entry:
  %2 = call i64 @julia_kernel_36630(i32 %0, i64 %1, i8 1), !dbg !65
  ret void
}
    // .globl   ptxcall_kernel_2 // -- Begin function ptxcall_kernel_2
.visible .entry ptxcall_kernel_2(
    .param .u32 ptxcall_kernel_2_param_0,
    .param .u64 ptxcall_kernel_2_param_1
)                                       // @ptxcall_kernel_2
{
    .reg .b32   %r<3>;
    .reg .b64   %rd<3>;

// %bb.0:                               // %entry
    ld.param.u32    %r1, [ptxcall_kernel_2_param_0];
    ld.param.u64    %rd1, [ptxcall_kernel_2_param_1];
    mov.u32     %r2, 1;
    { // callseq 0, 0
    .reg .b32 temp_param_reg;
    .param .b32 param0;
    st.param.b32    [param0+0], %r1;
    .param .b64 param1;
    st.param.b64    [param1+0], %rd1;
    .param .b32 param2;
    st.param.b32    [param2+0], %r2;
    .param .b64 retval0;
    call.uni (retval0), 
    julia_kernel_36676, 
    (
    param0, 
    param1, 
    param2
    );
    ld.param.b64    %rd2, [retval0+0];
    } // callseq 0
    ret;
}
                                        // -- End function

ptxas works with this info and now generates valid SASS:

code for sm_35
    Function : ptxcall_kernel_3
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                           
    /*0008*/                   MOV R1, c[0x0][0x44];                       
    /*0010*/                   CAL 0x28;                                   
    /*0018*/                   MOV RZ, RZ;                                 
    /*0020*/                   EXIT;                                       
    /*0028*/                   S2R R2, SR_TID.X;                           
    /*0030*/                   MOV32I R3, 0x4;                             
    /*0038*/                   ISETP.NE.AND P0, PT, R2, RZ, PT;            
                                                                           
    /*0048*/              @!P0 MOV R6, c[0x0][0x140];                      
    /*0050*/                   ISCADD R4.CC, R2, c[0x0][0x148], 0x2;       
    /*0058*/              @!P0 STS [RZ], R6;                               
    /*0060*/                   IMAD.U32.U32.HI.X R3, R2, R3, c[0x0][0x14c];
    /*0068*/                   BAR.SYNC 0x0;                               
    /*0070*/                   LDS R0, [RZ];                               
    /*0078*/                   MOV R2, R4;                                 
                                                                           
    /*0088*/                   PRMT R4, RZ, 0x7, R0;                       
    /*0090*/                   PRMT R5, RZ, 0x76, R0;                      
    /*0098*/                   PRMT R6, RZ, 0x765, R0;                     
    /*00a0*/                   PRMT R7, R0, 0x7610, R7;                    
    /*00a8*/                   ST.E.U8 [R2+0x3], R4;                       
    /*00b0*/                   ST.E.U8 [R2+0x2], R5;                       
    /*00b8*/                   ST.E.U8 [R2+0x1], R6;                       
                                                                           
    /*00c8*/                   ST.E.U8 [R2], R7;                           
    /*00d0*/                   RET;                                        
    /*00d8*/                   BRA 0xd8;                                   
    /*00e0*/                   NOP;                                        
    /*00e8*/                   NOP;                                        
    /*00f0*/                   NOP;                                        
    /*00f8*/                   NOP;                                        

You actually don't need the trap, a call to vprintf + unreachable seems sufficient to end up with "unsupported" control flow:

@0 = internal unnamed_addr constant [27 x i8] c"go home ptxas you're drunk\00"

define i64 @julia_kernel_36648(i32, i64, i8) local_unnamed_addr {
top:
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !range !2
  %4 = icmp eq i32 %3, 0
  %5 = and i8 %2, 1
  br i1 %4, label %L17, label %L30

L17:                                              ; preds = %top
  %6 = icmp eq i8 %5, 0
  br i1 %6, label %L19, label %L25

L19:                                              ; preds = %L17
  %7 = call i32 @vprintf(i8* getelementptr inbounds ([27 x i8], [27 x i8]* @0, i64 0, i64 0), i8* null)
  unreachable

L25:                                              ; preds = %L17
  store i32 %0, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
  br label %L30

L30:                                              ; preds = %top, %L25
  call void @llvm.nvvm.barrier0()
  %8 = icmp eq i8 %5, 0
  br i1 %8, label %L33, label %L39

L33:                                              ; preds = %L30
  %9 = call i32 @vprintf(i8* getelementptr inbounds ([27 x i8], [27 x i8]* @0, i64 0, i64 0), i8* null)
  unreachable

L39:                                              ; preds = %L30
  %10 = load i32, i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @shmem1, i64 0, i64 0), align 16, !tbaa !3
  %11 = zext i32 %3 to i64
  %12 = inttoptr i64 %1 to i32*
  %13 = getelementptr inbounds i32, i32* %12, i64 %11
  store i32 %10, i32* %13, align 1, !tbaa !6
  ret i64 %1
}

All this is done with LLVM 6.0, with quite some patches but none specific to NVPXT. What LLVM flags are relevant here? PTX MC target is initialized with only a ISA flag set , targeting sm_35 in this case. I also set --nvptx-fma-level=1.

@maleadt
Copy link
Member Author

maleadt commented Nov 8, 2018

Observations after having lost some more time on this:

  1. exit behaves differently than trap or brkpt: replacing calls to llvm.trap with inline asm "exit;" yielded invalid control flow post ptxas, while trap or brkpt didn't.
  2. the bug seems particularly sensitive to thread-divergent branches rather than multiple function exits. however, many of the examples here used trap or otherwise unreachable code, in which case LLVM often restructured the CFG to contain these invalid branches.
  3. instances of the bug can sometimes be spotted with cuda-memcheck --tool=synccheck (if the code used synchronization of course), printing Barrier error detected. Divergent thread(s) in warp

Currently trying out some fairly horrible transformations that replace llvm.trap with an inline assembly equivalent, and replace other sources of unreachable with branches to whatever's close and hopefully synchronizing or not thread-divergent.

It seems to work OK and passes our fairly comprehensive tests, some of which consistently fail without these transformations (toolkit v10, driver v410.57). Closing this for now, I don't think we can do much better (apart from improving the transformations / moving to LLVM / convincing NVIDIA to fix their stuff).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants