improved product iterator #22989

Jutho · 2017-07-27T22:41:49Z

This is basically a complete replacement of the implementation of the product iterator in Base.Iterators. It is more efficient and remains type stable for a larger number of iterators. The current product iterator starts allocating for 6 or more iterators, whereas the implementation in this PR remains type stable and therefore allocation free up to 14 iterators.

In the benchmark below, I also compare to the more specialised but therefore also more efficient CartesianRange iterator. The current implementation seems almost as efficient for the case where all iterators are ranges, so that CartesianRange could maybe become a disguised product iterator that wraps the output tuple in a CartesianIndex.

I think @JeffBezanson implemented the current product iterator in master, and @timholy did most of the CartesianRange work.

Benchmark:

using BenchmarkTools
using Base.Iterators.product
@noinline dosomething(x) = x
function mycount(itr)
    n = 0
    for x in itr
        dosomething(x)
        n+=1
    end
    return n
end
times=Vector(14)
for N = 1:14
    iter = product(ntuple(n->1:4, Val(N))...) # or iter = CartesianRange(...)
    times[N] = @benchmark mycount($iter)
end

and results in

N	PR `product`	`CartesianRange`	master `product`
1	Trial(8.665 ns)	Trial(7.146 ns)	Trial(7.766 ns)
2	Trial(93.655 ns)	Trial(83.618 ns)	Trial(88.639 ns)
3	Trial(367.870 ns)	Trial(345.069 ns)	Trial(700.325 ns)
4	Trial(1.774 μs)	Trial(1.718 μs)	Trial(4.662 μs)
5	Trial(7.556 μs)	Trial(6.608 μs)	Trial(26.906 μs)
6	Trial(35.315 μs)	Trial(32.732 μs)	Trial(15.894 ms)
7	Trial(142.293 μs)	Trial(130.331 μs)	Trial(64.895 ms)
8	Trial(612.727 μs)	Trial(576.491 μs)	Trial(279.602 ms)
9	Trial(2.643 ms)	Trial(2.326 ms)	Trial(1.231 s)
10	Trial(11.177 ms)	Trial(10.756 ms)	Trial(4.361 s)
11	Trial(54.838 ms)	Trial(47.823 ms)	Trial(19.058 s)
12	Trial(245.494 ms)	Trial(234.108 ms)	???
13	Trial(1.095 s)	Trial(949.563 ms)	???
14	Trial(5.247 s)	Trial(4.466 s)	???

Jutho · 2017-07-27T22:47:53Z

Also, less lines of code.

And this supports product() which produces a single empty tuple () and has size(product()) = (), the equivalent of a zero-dimensional array.

JeffBezanson · 2017-07-27T22:59:07Z

Awesome! What a nice surprise. A better product iterator is always on my wish list.

JeffBezanson · 2017-07-27T23:00:01Z

base/iterators.jl

-end
+size(P::ProductIterator) = _prod_size(P.iterators)
+_prod_size(::Tuple{}) = ()
+_prod_size(t::Tuple) = tuple(_prod_size1(t[1], iteratorsize(t[1]))..., _prod_size(tail(t))...)


tuple is not necessary; you can just use parens.

JeffBezanson · 2017-07-27T23:00:59Z

base/iterators.jl

+    return (state, tailstates...), (val, tailvals...)
+end
+_prod_next(iterators::Tuple{}, states::Tuple{}, values::Tuple{}) = true, (), ()
+function _prod_next(iterators, states, values)


I assume this is getting inlined? Kind of amazing that it doesn't need @inline. I might add it just to be safe.

I will test again cause it was quite late yesterday evening, but I think I benchmarked with and without @inline and didn't see a difference.

This is one of the differences between the new inliner and the old one: yes, this is a lot of statements, but each one of them essentially boils down to a couple of intrinsics/primitives taking very few CPU cycles and hence is "cheap" by the metric used in the new inliner.

We've long had a special inlining bonus for functions like next; I kept that and it may be contributing here, but I'm not actually sure it's necessary (or even helpful) anymore.

Oh, duh, it's definitely not helping here because this function is not named next. This is pretty good evidence we could take that bonus out and use the "pure" algorithm.

I don't know the details of how the inliner works, but is there a limit to the inlining of recursive definitions? In this case, the code of next for the product of 14 iterators is already gigantic, though it still seems more efficient than with the explicit @noinline, as indicated by my follow up benchmark in the main discussion.

However, in my own project, I have another iterator, somewhat similar to product but where the different iterators are coupled in some tree structure, such that the actual nth iterator depends on the state of the previous n-1. In that case, I noticed that with explicit @inline the compilation time became huge for N larger than 10 or so; so I decided to get rid of these explicit @inlines and trust the inliner. That's why I also didn't use explicit @inline in this PR.

I don't know the details of how the inliner works, but is there a limit to the inlining of recursive definitions?

Yes, but it falls out of the general algorithm. Writeup is here. Perhaps the most succinct statement of the overall design is in NEWS.

JeffBezanson · 2017-07-28T00:50:31Z

Interesting 32-bit AV failure.

JeffBezanson · 2017-07-28T02:42:45Z

cc @yuyichao

iamed2 · 2017-07-28T02:48:49Z

Awesome! I think this will make IterTools.product obsolete.

tkelman · 2017-07-28T03:05:37Z

base/iterators.jl

@@ -4,7 +4,7 @@ module Iterators

 import Base: start, done, next, isempty, length, size, eltype, iteratorsize, iteratoreltype, indices, ndims

-using Base: tuple_type_cons, SizeUnknown, HasLength, HasShape, IsInfinite, EltypeUnknown, HasEltype, OneTo, @propagate_inbounds
+using Base: tail, tuple_type_head, tuple_type_tail, tuple_type_cons, SizeUnknown, HasLength, HasShape, IsInfinite, EltypeUnknown, HasEltype, OneTo, @propagate_inbounds


Thanks, I will fix in a next commit, after having received advise on the two questions I have just posted.

Jutho · 2017-07-28T09:28:34Z

@JeffBezanson , @code_typed on start or next show that the definitions are indeed automatically inlined. Here is another benchmark (different machine), comparing to explicitly adding a @noinline to the definitions on lines 665 and 670 (_prod_start and _prod_next).

N	Master `product`	`CartesianRange`	PR `product`	PR `product` + `@noinline`
1	Trial(8.764 ns)	Trial(7.996 ns)	Trial(9.532 ns)	Trial(28.084 ns)
2	Trial(117.982 ns)	Trial(89.971 ns)	Trial(95.366 ns)	Trial(283.031 ns)
3	Trial(715.138 ns)	Trial(369.146 ns)	Trial(391.787 ns)	Trial(1.091 μs)
4	Trial(6.382 μs)	Trial(1.610 μs)	Trial(1.928 μs)	Trial(4.985 μs)
5	Trial(24.553 μs)	Trial(6.697 μs)	Trial(7.370 μs)	Trial(25.849 μs)
6	Trial(18.614 ms)	Trial(28.126 μs)	Trial(31.573 μs)	Trial(116.198 μs)
7	Trial(82.078 ms)	Trial(119.982 μs)	Trial(137.337 μs)	Trial(370.166 μs)
8	Trial(346.659 ms)	Trial(579.314 μs)	Trial(614.885 μs)	Trial(1.498 ms)
9	Trial(1.417 s)	Trial(2.389 ms)	Trial(2.715 ms)	Trial(6.720 ms)
10	Trial(5.322 s)	Trial(9.921 ms)	Trial(18.652 ms)	Trial(28.896 ms)
11	Trial(22.612 s)	Trial(72.069 ms)	Trial(50.371 ms)	Trial(119.649 ms)
12	Trial(92.437 s)	Trial(212.206 ms)	Trial(207.817 ms)	Trial(513.239 ms)
13	#undef	Trial(850.373 ms)	Trial(941.750 ms)	Trial(2.060 s)
14	#undef	Trial(4.012 s)	Trial(4.170 s)	Trial(9.121 s)

Jutho · 2017-07-28T09:33:49Z

Two questions:

Is the 32-bit failure due to this PR?
For a product of HasLength() iterators, iteratorsize of the product iterator is HasShape() (with this PR even for the zero-dimensional case of no iterators). The only exception is for the product of a single operator, then iteratorsize of the product iterator is just that of its single item and thus also HasLength(). I made it be HasShape() consistently at first, but this is in violation with a test. However, I find this somewhat ambiguous. For consistency, I would think that the product of a couple of plain simple iterators is HasShape(), even if there is just a single one. Is there any difference between an iterator with HasLength() and one with HasShape() whose size is just one-dimensional (d,)?

iamed2 · 2017-07-28T13:13:34Z

With HasShape() does that mean e.g. collect(product((1,2), (3,))) == [1 3; 2 3]?

Jutho · 2017-07-28T13:15:28Z

Not quite what you write, but that's already the case on master:

collect(Base.Iterators.product((1,2),(3,)))
-> 2×1 Array{Tuple{Int64,Int64},2}:
 (1, 3)
 (2, 3)

JeffBezanson · 2017-07-28T20:09:54Z

Probably not due to this PR. Even if this PR somehow exposes it, it's not at fault since all-julia code shouldn't cause segfaults.
Making it HasShape() should be harmless; let's try making that change.

KristofferC · 2017-07-28T20:15:55Z

Might as well @nanosoldier runbenchmarks(ALL, vs = ":master")

Jutho · 2017-07-28T22:18:06Z

Thanks @JeffBezanson for the response. Implementing your response to question 2 actually amounts to code reduction, namely removing line 609 (iteratorsize(::Type{ProductIterator{Tuple{I}}}) where {I} )

Another question before preparing a next commit: Currently, I use tuple_type_head and tuple_type_tail to compute properties in the type domain

iteratorsize(::Type{ProductIterator{T}}) where {T<:Tuple} =
    prod_iteratorsize( iteratorsize(tuple_type_head(T)), iteratorsize(ProductIterator{tuple_type_tail(T)}) )

Could this also be written using the @pure macro, e.g. something like

@pure iteratorsize(::Type{ProductIterator{T}}) where {T<:Tuple} = 
    mapreduce(iteratorsize, prod_iteratorsize, T.parameters)

Though this does not seem to make iteratorsize type stable? I still have to learn how to use the @pure macro correctly. Why does this not work here and is there a better approach, or is the current style (using tuple_type_head and tuple_type_tail) the preferred approach?

JeffBezanson · 2017-07-28T22:30:14Z

tuple_type_head and tuple_type_tail are definitely better. Even knowing T<:Tuple does not imply that T.parameters will work, since T might be a Union or UnionAll type.

nanosoldier · 2017-07-29T00:11:40Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

KristofferC · 2017-07-29T00:38:28Z

Could be worth looking into the spellcheck benchmark?

Jutho · 2017-07-30T12:12:29Z

Yes there are indeed more allocations in that test, also when I run the benchmarks locally. I will try to understand how it is affected by product and why it is affected in the wrong direction.

ararslan · 2017-08-02T19:35:31Z

@nanosoldier runbenchmarks(ALL, vs=":master")

Jutho · 2017-08-02T19:42:05Z

Thanks for restarting the benchmarks.

The implementation has changed quite a bit still, and is not fully finalized / cleaned up. In particular:

I am no longer using a custom struct for the iterator state, but only tuples with Nullables. The reason is that, if any of the individual states or values of the individual iterators is not isbits, the whole state suffered from this in the old implementation and created allocations. That was causing the slowdown in the spellcheck benchmarks, where the values are strings.
I am treating the first iterator different, so that it's value is just generated upon calling next and does not need to be stored in the state. This further reduces allocations if it is not isbits.

And then another comment. I think I came across a bug in Julia (or LLVM) code generation/optimization. I refer to my comment in the code for further info. I will test some more and will file a separate issue if I am convinced of my case.

Jutho · 2017-08-02T19:45:59Z

base/iterators.jl

+    iter1 = first(iterators)
+    value1, state1 = next(iter1, states[1])
+    tailstates = tail(states)
+    values = (value1, map(unsafe_get, state[3])...) # safe if not done(P, state)


state[3] should be equal to nvalues here. However, I came across a bug where, if I write nvalues here, it seems like it is taking the "value" of nvalues after it has been overwritten in the next block of code. In particular, in the final state before being done, none of the elements in nvalues will be isnull, but then, after the next block of code, they are.

Put differently, this function produced a different answer when I just call it versus when I run it line by line in the REPL. Or yet differently, if you replace state[3] by nvalues in this place, which should be a valid replacement, the tests fail outside of an actual test. I will analyse further and file a more complete bug description if I am convinced of my case.

nanosoldier · 2017-08-02T23:07:36Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

Jutho · 2017-08-04T11:12:49Z

I think the benchmarks are ok now? The ["scalar","cos"] benchmark seems to be an anomaly. I cannot reproduce it locally. Sometimes another test out of that group differs, but these are elementary cos calculations where iterators are not involved, and whose timings are very short but probably very fluctuating.

The ["random","ranges",...] benchmark does indeed allocate 4 more bytes (on a total of 420 bytes). I don't see immediately where product iterator is involved, but the effect on the run time is actually beneficial (though probably negligible).

All the other differences seem to be improvements.

StefanKarpinski · 2017-08-04T12:34:42Z

LGTM, but I'll leave it to @JeffBezanson to pull the trigger on this one.

Jutho · 2017-08-04T13:46:58Z

That's great but hold off merging until I have further investigated the issue about which I wrote a comment in the code.

Jutho · 2017-08-04T14:37:47Z

OK, I am not able to reproduce it on my desktop computer; maybe it was something in my environment. I've pushed the last commit restoring the code to how it was before the error (also removing the superfluous use of tuple).

Jutho · 2017-08-04T23:16:31Z

I don't know what happend with appveyor. On Travis, 64bit linux errored on an Arpack failure with a test of eigs and 32 bit linux timed out (even though it seemed actually finished, successfully).

Regarding the eigs error, definitely unrelated to this pr, but:

(eigs(speye(50), nev=10))[1] ≈ ones(10)

seems prone to give errors. Trying to build a Krylov subspace with an identity matrix is asking for trouble, as every new vector created will be identical to the previous one and therefore become zero after orthogonalization. Not saying that Arpack shouldn't account for that, but it can be rather buggy software and this test will basically trigger that behaviour.

I would recommend using another diagonal matrix as a test.

JeffBezanson · 2017-08-04T23:19:49Z

How does this version do on the original benchmarks in #22989 (comment) ? No need to post numbers, just want to double-check if the story is roughly the same.

Jutho · 2017-08-07T12:52:07Z

Roughly the same; maybe a little bit regression with respect to CartesianRange but certainly still type stable and usable up to N=14. This is the comparison with CartesianRange (on a different machine than previous benchmarks):

pr	cart
Trial(9.278 ns)	Trial(7.994 ns)
Trial(113.876 ns)	Trial(89.971 ns)
Trial(391.584 ns)	Trial(369.146 ns)
Trial(1.842 μs)	Trial(1.938 μs)
Trial(9.488 μs)	Trial(6.697 μs)
Trial(43.086 μs)	Trial(28.883 μs)
Trial(184.241 μs)	Trial(123.112 μs)
Trial(740.347 μs)	Trial(545.719 μs)
Trial(4.873 ms)	Trial(2.255 ms)
Trial(15.256 ms)	Trial(9.922 ms)
Trial(70.886 ms)	Trial(73.226 ms)
Trial(332.448 ms)	Trial(208.451 ms)
Trial(1.324 s)	Trial(851.376 ms)
Trial(5.863 s)	Trial(3.862 s)

Unfortunately trying to run the above benchmark on latest master yields:

ERROR: syntax: invalid syntax (escape (call (outerref mycount) ##iter1#776))

which I guess is an issue with BenchmarkTools.jl

Jutho · 2017-08-07T21:15:22Z

Thanks!

improved product

3eb00e0

JeffBezanson reviewed Jul 27, 2017

View reviewed changes

ararslan added collections Data structures holding multiple items, e.g. sets performance Must go faster labels Jul 27, 2017

tkelman reviewed Jul 28, 2017

View reviewed changes

revised implementation + test change

12b8841

Jutho commented Aug 2, 2017

View reviewed changes

cleanup

31c4563

timholy mentioned this pull request Aug 6, 2017

Region splitting JuliaImages/ImageSegmentation.jl#10

Merged

JeffBezanson merged commit 66a505d into JuliaLang:master Aug 7, 2017

Jutho mentioned this pull request Aug 12, 2017

Performance and typing of 6+ dimensional generators #21058

Closed

Jutho mentioned this pull request Aug 20, 2017

Fix unit test by reverting cleanup commit #23358

Closed

Jutho deleted the jh/productiterator branch August 23, 2018 08:06

improved product iterator #22989

improved product iterator #22989

Conversation

Jutho commented Jul 27, 2017 • edited Loading

Jutho commented Jul 27, 2017

JeffBezanson commented Jul 27, 2017

JeffBezanson Jul 27, 2017

Choose a reason for hiding this comment

JeffBezanson Jul 27, 2017

Choose a reason for hiding this comment

Jutho Jul 28, 2017 • edited by vchuravy Loading

Choose a reason for hiding this comment

timholy Jul 28, 2017 • edited Loading

Choose a reason for hiding this comment

timholy Jul 28, 2017

Choose a reason for hiding this comment

Jutho Jul 28, 2017

Choose a reason for hiding this comment

timholy Jul 28, 2017 • edited Loading

Choose a reason for hiding this comment

JeffBezanson commented Jul 28, 2017

JeffBezanson commented Jul 28, 2017

iamed2 commented Jul 28, 2017 • edited Loading

tkelman Jul 28, 2017

Choose a reason for hiding this comment

Jutho Jul 28, 2017

Choose a reason for hiding this comment

Jutho commented Jul 28, 2017

Jutho commented Jul 28, 2017

iamed2 commented Jul 28, 2017

Jutho commented Jul 28, 2017

JeffBezanson commented Jul 28, 2017

KristofferC commented Jul 28, 2017

Jutho commented Jul 28, 2017

JeffBezanson commented Jul 28, 2017

nanosoldier commented Jul 29, 2017

KristofferC commented Jul 29, 2017

Jutho commented Jul 30, 2017

ararslan commented Aug 2, 2017

Jutho commented Aug 2, 2017

Jutho Aug 2, 2017

Choose a reason for hiding this comment

nanosoldier commented Aug 2, 2017

Jutho commented Aug 4, 2017

StefanKarpinski commented Aug 4, 2017

Jutho commented Aug 4, 2017

Jutho commented Aug 4, 2017

Jutho commented Aug 4, 2017

JeffBezanson commented Aug 4, 2017

Jutho commented Aug 7, 2017

Jutho commented Aug 7, 2017

Jutho commented Jul 27, 2017 •

edited

Loading

Jutho Jul 28, 2017 •

edited by vchuravy

Loading

timholy Jul 28, 2017 •

edited

Loading

timholy Jul 28, 2017 •

edited

Loading

iamed2 commented Jul 28, 2017 •

edited

Loading