julep: extended iteration API, with proof-of-principle for fixing performance of Cartesian iteration #16878

timholy · 2016-06-11T11:38:20Z

This proposes a change in our iteration API (refs #9182, #9178, #6125) that ~~narrows the gap on~~ fixes #9080 ~~(which seems to have gotten worse)~~, following the convoluted process of discovery in #16035. Let's suppose this function:

function mysum(A)
    s = zero(eltype(A))
    for I in eachindex(A)
        @inbounds s += A[I]
    end
    s
end

got expanded to something like this (note: I'm passing in the iterator as an argument so I can compare performance of existing and new implementations):

function itergeneral(A, iter)
    # We'll sum over the elements of A that lie within bounds of iter
    s = zero(eltype(A))
    state = start(iter)
    isdone = maybe_done(iter, state)
    if isdone
        isdone, state = done(iter, state, true)  # use dummy 3rd arg to indicate 2 returns
    end
    if !isdone
        while true
            item, state = next(iter, state)
            @inbounds s += A[item]
            if maybe_done(iter, state)
                isdone, state = done(iter, state, true)
                isdone && break
            end
        end
    end
    s
end

To avoid breakage of current code, we need the following fallback definitions:

import Base: start, next, done, getindex

# fallback definitions
maybe_done(iter, state) = true
# For 3-argument code when only the 2-argument version is defined
done(iter, state, ::Bool) = done(iter, state), state
# For explicit 2-argument code when only the 3-argument version is defined
done(iter, state) = done(iter, state, true)[1]

The only downside of this I see is that if no valid definition of done exists, now you get a StackOverflow rather than MethodError. Aside from that, this appears to be non-breaking. If the compiler knows that maybe_done always returns true for iter, then note that the code above simplifies to

function itergeneral(A, iter)
    # We'll sum over the elements of A that lie within bounds of iter
    s = zero(eltype(A))
    state = start(iter)
    isdone, state = done(iter, state, true)  # use dummy 3rd arg to indicate 2 returns
    if !isdone
        while true
            item, state = next(iter, state)
            @inbounds s += A[item]
            isdone, state = done(iter, state, true)
            isdone && break
        end
    end
    s
end

An advantage of this more symmetric interface is that either next or done can increment the state; therefore, this subsumes the nextval/nextstate split first proposed in #6125. But unlike the implementation in #9182, thanks to maybe_done this also introduces a key advantage for #9080: it ensures that typically (when dimension 1 has size larger than 1) there's only one branch per iteration, using two or more branches only when the "carry" operation needs to be performed.

Proof that this helps #9080 (with the previous definitions all loaded):

using Base: tail, @propagate_inbounds

immutable CartIdx{N}
    I::NTuple{N,Int}
    CartIdx(index::NTuple{N,Integer}) = new(index)
end
CartIdx{N}(index::NTuple{N,Integer}) = CartIdx{N}(index)
getindex(ci::CartIdx, d::Integer) = ci.I[d]
@propagate_inbounds getindex(A::AbstractArray, ci::CartIdx{2}) = A[ci.I[1], ci.I[2]]

immutable CartRange{I}
    start::I
    stop::I
end

@inline start(cr::CartRange) = cr.start
@inline next(cr::CartRange, state::CartIdx) = state, inc1(state)
@inline inc1(state) = CartIdx((state[1]+1, tail(state.I)...))  # a "dumb" incrementer
@inline maybe_done(cr::CartRange, state::CartIdx) = state[1] > cr.stop[1]
@inline function done{N}(cr::CartRange, state::CartIdx{N}, ::Bool)
    # handle the carry operation (clean up after the "dumb" incrementer)
    state = CartIdx(carry((), state.I, cr.start.I, cr.stop.I))
    state[N] > cr.stop[N], state
end
@inline carry(out, ::Tuple{}, ::Tuple{}, ::Tuple{}) = out
@inline carry(out, s::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int}) = (out..., s[1])
@inline function carry(out, s, start, stop)
    if s[1] > stop[1]
        return carry((out..., start[1]), (s[2]+1, tail(tail(s))...), tail(start), tail(stop))
    end
    return s
end

with the following test script (sumcart_manual and sumcart_iter are from #9080):

A = rand(10^4, 10^4)
sumcart_manual(A)
sumcart_iter(A)
iter = CartRange(CartIdx((1,1)), CartIdx(size(A)))
R = CartesianRange(size(A))
itergeneral(A, R)
itergeneral(A, iter)
@time 1
@time sumcart_manual(A)
@time sumcart_iter(A)
@time itergeneral(A, R)
@time itergeneral(A, iter)

with results

  0.000003 seconds (149 allocations: 9.013 KB)
  0.106422 seconds (5 allocations: 176 bytes)
  0.127288 seconds (5 allocations: 176 bytes)
  0.228374 seconds (5 allocations: 176 bytes)
  0.106804 seconds (5 allocations: 176 bytes)

It's awesome that this is almost 3x faster than our current scheme. It stinks that it's still not as good as the manual version. I profiled it, and almost all the time is spend on @inbounds s += A[item]. Other than the possibility that somehow the CPU isn't as good at cache prefetch with this code as with traditional for-loop code, I'm at a loss to explain the gap.

Note that currently we have one other way of circumventing #9080: add @simd. Even when this doesn't actually turn on vectorization, the @simd macro splits out the cartesian iterator into "inner index" and "remaining index," and thus achieves parity with the manual version (i.e., is better than this julep) even without vectorization. Now, @simd is limited, so this julep is still attractive, but I would feel stronger about pushing for it if we could achieve parity.

The text was updated successfully, but these errors were encountered:

eschnett · 2016-06-11T13:50:32Z

If I understand this correctly, instead of maybe_done and done, you could have chosen fast_done and slow_done as names, or inner_done and outer_done. I assume it is this two-level structure that allows the compiler to produce better code.

There is much compiler technology to handle polyhedral loop nests, in particular two nested loops traversing a 2d array. I think it is impossible to generate the same structure (in terms of basic blocks) with a single loop, by instead using more complex conditions on which counter to increment. It might be possible in theory, but I don't think compiler optimizers are smart enough to recognize this. (I tried with templates in C++.) Instead (in C++), I ended up writing recursive code, where an n-dimensional array is traversed via n nested function calls; after inlining, you recover the usual n nested loops.

In Julia, we could write an LLVM pass that detects a "n nested loops traversed in a single loop" pattern, and transforms it to something that later LLVM passes (loop blocking, vectorization, other polyhedral optimizations) can recognize.

Otherwise, it is probably simpler to modify how for loops are translated: We can explicitly introduce "nested iterators", and translate them into a nested set of for loops.

[Half-cooked thoughts follow.] In Julia 0.5, we can use closures for this, giving iteration control to the iterator, thus foregoing the start, done, next calls that are so very tedious to implement:

for x in col
    body
end

translates syntactically (!) to

forall(col) do x
    body
end

where forall is implemented by the various collections.

timholy · 2016-06-11T14:16:04Z

Yes, your understanding is correct: maybe_done is essentially inner_done.

@nloops generates nested loops, but of course you have to know the number of loops at parse time, so it effectively requires a @generated function. The idea of generating closures is interesting, it would be interesting to bake this a bit more and see what comes out. Of course, with #15276 it's almost certain that the large majority of cases will encounter an inference problem.

mbauman · 2016-06-11T18:39:22Z

I ended up writing recursive code, where an n-dimensional array is traversed via n nested function calls; after inlining, you recover the usual n nested loops.

I tested this a long time ago with non-scalar indexing. It was quite a bit slower back then, but I've not tried it since. https://gist.github.com/mbauman/6434fff5f793cf0414559fa554822578

eschnett · 2016-06-11T19:17:04Z

Did you check the generated assembler code? In my case, it was very similar to a hand-written Fortran loop nest.

timholy · 2016-06-11T21:42:42Z

@eschnett, if you have some code it would be great to post it for comparison.

mbauman · 2016-06-12T20:12:16Z

Yeah, it's still a little slower. I think there's some splatting allocations happening somewhere.

timholy · 2016-06-12T22:57:48Z

In my top post I seemed to have missed some optimization opportunities above by not being aggressive enough with @inline. If I force inlining on every function related to iteration (maybe_done, start, next, and done), then in fact it appears to completely solve #9080.

timholy · 2016-06-12T23:07:08Z

Ah, but now the concern is that when you don't have maybe_done doing something useful (i.e., when you're relying on the fallback definitions), it's considerably slower. This suggests that LLVM isn't taking advantage of knowing that maybe_done is returning true and eliding the check. Hmmm.

tkelman · 2016-06-12T23:33:11Z

Maybe related to #16753, are Julia booleans not getting optimized like they should?

timholy · 2017-01-05T18:27:51Z

Closing in favor of #18823.

timholy mentioned this issue Jun 11, 2016

julep: a return-able break statement (improving iteration performance) #16035

Closed

kshyatt added the julep Julia Enhancement Proposal label Jun 14, 2016

tkelman mentioned this issue Oct 26, 2016

Julep: Revised iteration protocol #18823

Closed

timholy closed this as completed Jan 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

julep: extended iteration API, with proof-of-principle for fixing performance of Cartesian iteration #16878

julep: extended iteration API, with proof-of-principle for fixing performance of Cartesian iteration #16878

timholy commented Jun 11, 2016 •

edited

Loading

eschnett commented Jun 11, 2016

timholy commented Jun 11, 2016

mbauman commented Jun 11, 2016

eschnett commented Jun 11, 2016

timholy commented Jun 11, 2016

mbauman commented Jun 12, 2016

timholy commented Jun 12, 2016

timholy commented Jun 12, 2016

tkelman commented Jun 12, 2016

timholy commented Jan 5, 2017

julep: extended iteration API, with proof-of-principle for fixing performance of Cartesian iteration #16878

julep: extended iteration API, with proof-of-principle for fixing performance of Cartesian iteration #16878

Comments

timholy commented Jun 11, 2016 • edited Loading

eschnett commented Jun 11, 2016

timholy commented Jun 11, 2016

mbauman commented Jun 11, 2016

eschnett commented Jun 11, 2016

timholy commented Jun 11, 2016

mbauman commented Jun 12, 2016

timholy commented Jun 12, 2016

timholy commented Jun 12, 2016

tkelman commented Jun 12, 2016

timholy commented Jan 5, 2017

timholy commented Jun 11, 2016 •

edited

Loading