Simplify rewriting of unreachable blocks. #277

maleadt · 2018-11-05T11:56:10Z

No description provided.

maleadt · 2018-11-06T08:36:33Z

@vchuravy could you try this PR to see how it affects register pressure on your application?

I've switched approaches, instead of removing trap and trying to replace the subsequent unreachable by regular control flow, I'm now calling exit again but while trying to hide that control flow from both LLVM and ptxas (by reading from a global variable). Hopefully that sufficiently confuses ptxas not to mess up control flow.

maleadt · 2018-11-06T08:49:55Z

This is what it looks like:

julia> kernel(a) = (a[1]=1; nothing)
kernel (generic function with 1 method)

julia> CUDAnative.code_llvm(kernel, Tuple{CuDeviceArray{Int,2,AS.Global}})

define void @julia_kernel_1({ [2 x i64], { i64 } } addrspace(11)* nocapture nonnull readonly dereferenceable(24)) local_unnamed_addr {
top:
  %1 = getelementptr { [2 x i64], { i64 } }, { [2 x i64], { i64 } } addrspace(11)* %0, i64 0, i32 0, i64 0
  %2 = getelementptr { [2 x i64], { i64 } }, { [2 x i64], { i64 } } addrspace(11)* %0, i64 0, i32 0, i64 1
  %3 = load i64, i64 addrspace(11)* %1, align 8, !tbaa !1, !invariant.load !4
  %4 = load i64, i64 addrspace(11)* %2, align 8, !tbaa !1, !invariant.load !4
  %5 = mul i64 %4, %3
  %6 = icmp slt i64 %5, 1
  br i1 %6, label %L14, label %L17

L14:                                              ; preds = %top
  call fastcc void @ptx_throw_boundserror()
  call fastcc void @opaque_exit()
  br label %opaque_unreachable

L17:                                              ; preds = %top
  %7 = getelementptr inbounds { [2 x i64], { i64 } }, { [2 x i64], { i64 } } addrspace(11)* %0, i64 0, i32 1, i32 0
  %8 = bitcast i64 addrspace(11)* %7 to i64* addrspace(11)*
  %9 = load i64*, i64* addrspace(11)* %8, align 8, !tbaa !1, !invariant.load !4
  %10 = addrspacecast i64* %9 to i64 addrspace(1)*
  store i64 1, i64 addrspace(1)* %10, align 8, !tbaa !5
  ret void

opaque_unreachable:                                ; preds = %L14, %opaque_unreachable
  br label %opaque_unreachable
}

define internal fastcc void @opaque_exit() unnamed_addr {
entry:
  %0 = icmp eq i32 1, 1
  br i1 %0, label %trap.preheader, label %loop.preheader

loop.preheader:                                   ; preds = %entry
  br label %loop

trap.preheader:                                   ; preds = %entry
  br label %trap

trap:                                             ; preds = %trap.preheader, %trap
  call void asm sideeffect "exit;", ""() #0
  br label %trap

loop:                                             ; preds = %loop.preheader, %loop
  br label %loop
}

opaque_exit to do an exit without ptxas (hopefully) realizing so, and opaque_unreachable performing an infinite loop to hide unreachable from LLVM.

vchuravy · 2018-11-06T15:22:17Z

This is looking good. It reduced the register pressure even more (another 6 if I recall) in the kernels I was looking at.

The only snag I hit was ptxas complaining about:

┌ Info: kernel configuration
│   N = 2
│   threads = 100
│   blocks = 1
│   CUDAnative.maxthreads(kernel) = 1024
│   CUDAnative.registers(kernel) = 62
└   CUDAnative.memory(kernel) = (local = 248, shared = 0, constant = 0)
ptxas warning : Unresolved extern variable 'breaker_of_controlflow' in whole program compilation, ignoring extern qualifier

maleadt · 2018-11-06T15:30:33Z

Now this code doesn't work on this branch:

using Test
using Random
using CuArrays
using GPUArrays
using CUDAnative

function main()
    @eval CUDAnative globalUnique=0
    empty!(CUDAnative.compilecache)
    Random.seed!(0)

    A = rand(1:10, 100)
    @show cpu = mapreduce(identity, +, A)

    dA = CuArray(A)
    @show gpu = mapreduce(identity, +, dA)

    @test cpu ≈ gpu
end

function Base.mapreduce(f::Function, op::Function, A::CuArray{T, N}) where {T, N}
    OT = Int
    v0 = 0

    out = CuArray{OT,1}(undef, 1)
    @cuda threads=64 reduce_kernel(f, op, v0, A, out)
    Array(out)[1]
end

function reduce_kernel(f, op, v0::T, A, result) where {T}
    tmp_local = @cuStaticSharedMem(T, 64)
    acc = v0

    # Loop sequentially over chunks of input vector
    i = threadIdx().x
    while i <= length(A)
        element = f(A[i])
        acc = op(acc, element)
        i += blockDim().x
    end

    # Perform parallel reduction
    @inbounds tmp_local[threadIdx().x] = acc
    sync_threads()

    offset = blockDim().x ÷ 2
    while offset > 0
        @inbounds if threadIdx().x <= offset
            other = tmp_local[(threadIdx().x - 1) + offset + 1]
            mine = tmp_local[threadIdx().x]
            tmp_local[threadIdx().x] = op(mine, other)
        end
        sync_threads()
        offset = offset ÷ 2
    end

    if threadIdx().x == 1
        result[blockIdx().x] = @inbounds tmp_local[1]
    end

    return
end

eerily similar to #4. even fails without the call to exit, but works when not doing the unreachable transform. so either my transformation is wrong, or ptxas also fails when doing divergent branches...

maleadt · 2018-11-06T15:54:43Z

Sometimes getting Barrier error detected. Divergent thread(s) in warp under cuda-memcheck --tool synccheck

maleadt · 2018-11-08T14:10:04Z

Turns out we can leave the exit/trap and ptxas isn't confused as long there is no thread-divergent control flow! So I can more easily rewrite control flow since it doesn't need to be valid or non-looping -- we will have trapped the GPU.

@vchuravy could you have another look how this impacts register usage?

EDIT: aw crap this still causes test failures. argh.

maleadt · 2018-11-08T14:40:03Z

OK, works again when using trap instead of exit...
Also passes CuArrays tests, so that's looking good.

maleadt · 2018-11-08T15:26:37Z

Compared to current master this gives almost identical register usage except for the mapreduce kernel where it reduces a little. So this seems good to go.

maleadt mentioned this pull request Nov 5, 2018

Warning: unreachable control flow with multiple predecessors #275

Closed

maleadt force-pushed the tb/nothrow branch from bf78951 to 7c2deb8 Compare November 5, 2018 12:02

Simplify rewriting of unreachable blocks.

cbadebd

maleadt force-pushed the tb/nothrow branch from 7c2deb8 to cbadebd Compare November 5, 2018 12:58

maleadt added 2 commits November 5, 2018 13:59

Dump both optimized and unoptimized LLVM IR.

8d69eb2

Postpone rewriting terminators to avoid analysis confusion.

e7b1ee1

vchuravy mentioned this pull request Nov 5, 2018

Unhandled IR pattern in throw rewriting #258

Closed

Overhaul approach: try to hide control flow.

f8be7dd

maleadt force-pushed the tb/nothrow branch from c7b5655 to f8be7dd Compare November 6, 2018 08:34

maleadt added 4 commits November 6, 2018 10:35

Clean-ups.

9099bb6

Use externally-initialized memory (that defaults to 0).

d9c3639

Comments.

24eacbc

split pass.

cd737ff

maleadt force-pushed the tb/nothrow branch from 71c666c to cd737ff Compare November 6, 2018 10:06

Fix print.

31c8bc1

maleadt added 3 commits November 8, 2018 12:46

Fix error message.

815b0aa

Go back to redirecting control flow.

103753e

Restructure and update docs.

f9c24b3

maleadt added 3 commits November 8, 2018 15:20

Apparently exit behaves differently than trap......

cc3be63

Change pass order.

bc65295

Add tests.

3ad64a5

maleadt force-pushed the tb/nothrow branch from 74901d1 to c1d7b71 Compare November 8, 2018 14:37

maleadt mentioned this pull request Nov 8, 2018

Shared memory + multiple function exits cause invalid results #4

Closed

Fix comment.

cea3ef9

maleadt force-pushed the tb/nothrow branch from c1d7b71 to cea3ef9 Compare November 8, 2018 15:02

maleadt added bug codegen labels Nov 8, 2018

maleadt merged commit 87c8b5d into master Nov 8, 2018

maleadt deleted the tb/nothrow branch November 8, 2018 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify rewriting of unreachable blocks. #277

Simplify rewriting of unreachable blocks. #277

maleadt commented Nov 5, 2018

maleadt commented Nov 6, 2018

maleadt commented Nov 6, 2018

vchuravy commented Nov 6, 2018

maleadt commented Nov 6, 2018

maleadt commented Nov 6, 2018 •

edited

Loading

maleadt commented Nov 8, 2018 •

edited

Loading

maleadt commented Nov 8, 2018 •

edited

Loading

maleadt commented Nov 8, 2018

Simplify rewriting of unreachable blocks. #277

Simplify rewriting of unreachable blocks. #277

Conversation

maleadt commented Nov 5, 2018

maleadt commented Nov 6, 2018

maleadt commented Nov 6, 2018

vchuravy commented Nov 6, 2018

maleadt commented Nov 6, 2018

maleadt commented Nov 6, 2018 • edited Loading

maleadt commented Nov 8, 2018 • edited Loading

maleadt commented Nov 8, 2018 • edited Loading

maleadt commented Nov 8, 2018

maleadt commented Nov 6, 2018 •

edited

Loading

maleadt commented Nov 8, 2018 •

edited

Loading

maleadt commented Nov 8, 2018 •

edited

Loading