-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
julep: extended iteration API, with proof-of-principle for fixing performance of Cartesian iteration #16878
Comments
If I understand this correctly, instead of There is much compiler technology to handle polyhedral loop nests, in particular two nested loops traversing a 2d array. I think it is impossible to generate the same structure (in terms of basic blocks) with a single loop, by instead using more complex conditions on which counter to increment. It might be possible in theory, but I don't think compiler optimizers are smart enough to recognize this. (I tried with templates in C++.) Instead (in C++), I ended up writing recursive code, where an In Julia, we could write an LLVM pass that detects a " Otherwise, it is probably simpler to modify how [Half-cooked thoughts follow.] In Julia 0.5, we can use closures for this, giving iteration control to the iterator, thus foregoing the
translates syntactically (!) to
where |
Yes, your understanding is correct:
|
I tested this a long time ago with non-scalar indexing. It was quite a bit slower back then, but I've not tried it since. https://gist.github.com/mbauman/6434fff5f793cf0414559fa554822578 |
Did you check the generated assembler code? In my case, it was very similar to a hand-written Fortran loop nest. |
@eschnett, if you have some code it would be great to post it for comparison. |
Yeah, it's still a little slower. I think there's some splatting allocations happening somewhere. |
In my top post I seemed to have missed some optimization opportunities above by not being aggressive enough with |
Ah, but now the concern is that when you don't have |
Maybe related to #16753, are Julia booleans not getting optimized like they should? |
Closing in favor of #18823. |
This proposes a change in our iteration API (refs #9182, #9178, #6125) that
narrows the gap onfixes #9080(which seems to have gotten worse), following the convoluted process of discovery in #16035. Let's suppose this function:got expanded to something like this (note: I'm passing in the iterator as an argument so I can compare performance of existing and new implementations):
To avoid breakage of current code, we need the following fallback definitions:
The only downside of this I see is that if no valid definition of
done
exists, now you get aStackOverflow
rather thanMethodError
. Aside from that, this appears to be non-breaking. If the compiler knows thatmaybe_done
always returnstrue
foriter
, then note that the code above simplifies toAn advantage of this more symmetric interface is that either
next
ordone
can increment thestate
; therefore, this subsumes thenextval
/nextstate
split first proposed in #6125. But unlike the implementation in #9182, thanks tomaybe_done
this also introduces a key advantage for #9080: it ensures that typically (when dimension 1 has size larger than 1) there's only one branch per iteration, using two or more branches only when the "carry" operation needs to be performed.Proof that this helps #9080 (with the previous definitions all loaded):
with the following test script (
sumcart_manual
andsumcart_iter
are from #9080):with results
It's awesome that this is almost 3x faster than our current scheme. It stinks that it's still not as good as the manual version. I profiled it, and almost all the time is spend on@inbounds s += A[item]
. Other than the possibility that somehow the CPU isn't as good at cache prefetch with this code as with traditional for-loop code, I'm at a loss to explain the gap.Note that currently we have one other way of circumventing #9080: add@simd
. Even when this doesn't actually turn on vectorization, the@simd
macro splits out the cartesian iterator into "inner index" and "remaining index," and thus achieves parity with the manual version (i.e., is better than this julep) even without vectorization. Now,@simd
is limited, so this julep is still attractive, but I would feel stronger about pushing for it if we could achieve parity.The text was updated successfully, but these errors were encountered: