Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved product iterator #22989

Merged
merged 3 commits into from
Aug 7, 2017
Merged

Conversation

Jutho
Copy link
Contributor

@Jutho Jutho commented Jul 27, 2017

This is basically a complete replacement of the implementation of the product iterator in Base.Iterators. It is more efficient and remains type stable for a larger number of iterators. The current product iterator starts allocating for 6 or more iterators, whereas the implementation in this PR remains type stable and therefore allocation free up to 14 iterators.

In the benchmark below, I also compare to the more specialised but therefore also more efficient CartesianRange iterator. The current implementation seems almost as efficient for the case where all iterators are ranges, so that CartesianRange could maybe become a disguised product iterator that wraps the output tuple in a CartesianIndex.

I think @JeffBezanson implemented the current product iterator in master, and @timholy did most of the CartesianRange work.

Benchmark:

using BenchmarkTools
using Base.Iterators.product
@noinline dosomething(x) = x
function mycount(itr)
    n = 0
    for x in itr
        dosomething(x)
        n+=1
    end
    return n
end
times=Vector(14)
for N = 1:14
    iter = product(ntuple(n->1:4, Val(N))...) # or iter = CartesianRange(...)
    times[N] = @benchmark mycount($iter)
end

and results in

N PR product CartesianRange master product
1 Trial(8.665 ns) Trial(7.146 ns) Trial(7.766 ns)
2 Trial(93.655 ns) Trial(83.618 ns) Trial(88.639 ns)
3 Trial(367.870 ns) Trial(345.069 ns) Trial(700.325 ns)
4 Trial(1.774 μs) Trial(1.718 μs) Trial(4.662 μs)
5 Trial(7.556 μs) Trial(6.608 μs) Trial(26.906 μs)
6 Trial(35.315 μs) Trial(32.732 μs) Trial(15.894 ms)
7 Trial(142.293 μs) Trial(130.331 μs) Trial(64.895 ms)
8 Trial(612.727 μs) Trial(576.491 μs) Trial(279.602 ms)
9 Trial(2.643 ms) Trial(2.326 ms) Trial(1.231 s)
10 Trial(11.177 ms) Trial(10.756 ms) Trial(4.361 s)
11 Trial(54.838 ms) Trial(47.823 ms) Trial(19.058 s)
12 Trial(245.494 ms) Trial(234.108 ms) ???
13 Trial(1.095 s) Trial(949.563 ms) ???
14 Trial(5.247 s) Trial(4.466 s) ???

@Jutho
Copy link
Contributor Author

Jutho commented Jul 27, 2017

Also, less lines of code.

And this supports product() which produces a single empty tuple () and has size(product()) = (), the equivalent of a zero-dimensional array.

@JeffBezanson
Copy link
Member

Awesome! What a nice surprise. A better product iterator is always on my wish list.

end
size(P::ProductIterator) = _prod_size(P.iterators)
_prod_size(::Tuple{}) = ()
_prod_size(t::Tuple) = tuple(_prod_size1(t[1], iteratorsize(t[1]))..., _prod_size(tail(t))...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple is not necessary; you can just use parens.

return (state, tailstates...), (val, tailvals...)
end
_prod_next(iterators::Tuple{}, states::Tuple{}, values::Tuple{}) = true, (), ()
function _prod_next(iterators, states, values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is getting inlined? Kind of amazing that it doesn't need @inline. I might add it just to be safe.

Copy link
Contributor Author

@Jutho Jutho Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test again cause it was quite late yesterday evening, but I think I benchmarked with and without @inline and didn't see a difference.

Copy link
Member

@timholy timholy Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the differences between the new inliner and the old one: yes, this is a lot of statements, but each one of them essentially boils down to a couple of intrinsics/primitives taking very few CPU cycles and hence is "cheap" by the metric used in the new inliner.

We've long had a special inlining bonus for functions like next; I kept that and it may be contributing here, but I'm not actually sure it's necessary (or even helpful) anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, duh, it's definitely not helping here because this function is not named next. This is pretty good evidence we could take that bonus out and use the "pure" algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the details of how the inliner works, but is there a limit to the inlining of recursive definitions? In this case, the code of next for the product of 14 iterators is already gigantic, though it still seems more efficient than with the explicit @noinline, as indicated by my follow up benchmark in the main discussion.

However, in my own project, I have another iterator, somewhat similar to product but where the different iterators are coupled in some tree structure, such that the actual nth iterator depends on the state of the previous n-1. In that case, I noticed that with explicit @inline the compilation time became huge for N larger than 10 or so; so I decided to get rid of these explicit @inlines and trust the inliner. That's why I also didn't use explicit @inline in this PR.

Copy link
Member

@timholy timholy Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the details of how the inliner works, but is there a limit to the inlining of recursive definitions?

Yes, but it falls out of the general algorithm. Writeup is here. Perhaps the most succinct statement of the overall design is in NEWS.

@ararslan ararslan added collections Data structures holding multiple items, e.g. sets performance Must go faster labels Jul 27, 2017
@JeffBezanson
Copy link
Member

Interesting 32-bit AV failure.

@JeffBezanson
Copy link
Member

cc @yuyichao

@iamed2
Copy link
Contributor

iamed2 commented Jul 28, 2017

Awesome! I think this will make IterTools.product obsolete.

@@ -4,7 +4,7 @@ module Iterators

import Base: start, done, next, isempty, length, size, eltype, iteratorsize, iteratoreltype, indices, ndims

using Base: tuple_type_cons, SizeUnknown, HasLength, HasShape, IsInfinite, EltypeUnknown, HasEltype, OneTo, @propagate_inbounds
using Base: tail, tuple_type_head, tuple_type_tail, tuple_type_cons, SizeUnknown, HasLength, HasShape, IsInfinite, EltypeUnknown, HasEltype, OneTo, @propagate_inbounds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line wrap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will fix in a next commit, after having received advise on the two questions I have just posted.

@Jutho
Copy link
Contributor Author

Jutho commented Jul 28, 2017

@JeffBezanson , @code_typed on start or next show that the definitions are indeed automatically inlined. Here is another benchmark (different machine), comparing to explicitly adding a @noinline to the definitions on lines 665 and 670 (_prod_start and _prod_next).

N Master product CartesianRange PR product PR product + @noinline
1 Trial(8.764 ns) Trial(7.996 ns) Trial(9.532 ns) Trial(28.084 ns)
2 Trial(117.982 ns) Trial(89.971 ns) Trial(95.366 ns) Trial(283.031 ns)
3 Trial(715.138 ns) Trial(369.146 ns) Trial(391.787 ns) Trial(1.091 μs)
4 Trial(6.382 μs) Trial(1.610 μs) Trial(1.928 μs) Trial(4.985 μs)
5 Trial(24.553 μs) Trial(6.697 μs) Trial(7.370 μs) Trial(25.849 μs)
6 Trial(18.614 ms) Trial(28.126 μs) Trial(31.573 μs) Trial(116.198 μs)
7 Trial(82.078 ms) Trial(119.982 μs) Trial(137.337 μs) Trial(370.166 μs)
8 Trial(346.659 ms) Trial(579.314 μs) Trial(614.885 μs) Trial(1.498 ms)
9 Trial(1.417 s) Trial(2.389 ms) Trial(2.715 ms) Trial(6.720 ms)
10 Trial(5.322 s) Trial(9.921 ms) Trial(18.652 ms) Trial(28.896 ms)
11 Trial(22.612 s) Trial(72.069 ms) Trial(50.371 ms) Trial(119.649 ms)
12 Trial(92.437 s) Trial(212.206 ms) Trial(207.817 ms) Trial(513.239 ms)
13 #undef Trial(850.373 ms) Trial(941.750 ms) Trial(2.060 s)
14 #undef Trial(4.012 s) Trial(4.170 s) Trial(9.121 s)

@Jutho
Copy link
Contributor Author

Jutho commented Jul 28, 2017

Two questions:

  1. Is the 32-bit failure due to this PR?
  2. For a product of HasLength() iterators, iteratorsize of the product iterator is HasShape() (with this PR even for the zero-dimensional case of no iterators). The only exception is for the product of a single operator, then iteratorsize of the product iterator is just that of its single item and thus also HasLength(). I made it be HasShape() consistently at first, but this is in violation with a test. However, I find this somewhat ambiguous. For consistency, I would think that the product of a couple of plain simple iterators is HasShape(), even if there is just a single one. Is there any difference between an iterator with HasLength() and one with HasShape() whose size is just one-dimensional (d,)?

@iamed2
Copy link
Contributor

iamed2 commented Jul 28, 2017

With HasShape() does that mean e.g. collect(product((1,2), (3,))) == [1 3; 2 3]?

@Jutho
Copy link
Contributor Author

Jutho commented Jul 28, 2017

Not quite what you write, but that's already the case on master:

collect(Base.Iterators.product((1,2),(3,)))
-> 2×1 Array{Tuple{Int64,Int64},2}:
 (1, 3)
 (2, 3)

@JeffBezanson
Copy link
Member

  1. Probably not due to this PR. Even if this PR somehow exposes it, it's not at fault since all-julia code shouldn't cause segfaults.
  2. Making it HasShape() should be harmless; let's try making that change.

@KristofferC
Copy link
Member

Might as well @nanosoldier runbenchmarks(ALL, vs = ":master")

@Jutho
Copy link
Contributor Author

Jutho commented Jul 28, 2017

Thanks @JeffBezanson for the response. Implementing your response to question 2 actually amounts to code reduction, namely removing line 609 (iteratorsize(::Type{ProductIterator{Tuple{I}}}) where {I} )

Another question before preparing a next commit: Currently, I use tuple_type_head and tuple_type_tail to compute properties in the type domain

iteratorsize(::Type{ProductIterator{T}}) where {T<:Tuple} =
    prod_iteratorsize( iteratorsize(tuple_type_head(T)), iteratorsize(ProductIterator{tuple_type_tail(T)}) )

Could this also be written using the @pure macro, e.g. something like

@pure iteratorsize(::Type{ProductIterator{T}}) where {T<:Tuple} = 
    mapreduce(iteratorsize, prod_iteratorsize, T.parameters)

Though this does not seem to make iteratorsize type stable? I still have to learn how to use the @pure macro correctly. Why does this not work here and is there a better approach, or is the current style (using tuple_type_head and tuple_type_tail) the preferred approach?

@JeffBezanson
Copy link
Member

tuple_type_head and tuple_type_tail are definitely better. Even knowing T<:Tuple does not imply that T.parameters will work, since T might be a Union or UnionAll type.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@KristofferC
Copy link
Member

Could be worth looking into the spellcheck benchmark?

@Jutho
Copy link
Contributor Author

Jutho commented Jul 30, 2017

Yes there are indeed more allocations in that test, also when I run the benchmarks locally. I will try to understand how it is affected by product and why it is affected in the wrong direction.

@ararslan
Copy link
Member

ararslan commented Aug 2, 2017

@nanosoldier runbenchmarks(ALL, vs=":master")

@Jutho
Copy link
Contributor Author

Jutho commented Aug 2, 2017

Thanks for restarting the benchmarks.

The implementation has changed quite a bit still, and is not fully finalized / cleaned up. In particular:

  • I am no longer using a custom struct for the iterator state, but only tuples with Nullables. The reason is that, if any of the individual states or values of the individual iterators is not isbits, the whole state suffered from this in the old implementation and created allocations. That was causing the slowdown in the spellcheck benchmarks, where the values are strings.
  • I am treating the first iterator different, so that it's value is just generated upon calling next and does not need to be stored in the state. This further reduces allocations if it is not isbits.

And then another comment. I think I came across a bug in Julia (or LLVM) code generation/optimization. I refer to my comment in the code for further info. I will test some more and will file a separate issue if I am convinced of my case.

iter1 = first(iterators)
value1, state1 = next(iter1, states[1])
tailstates = tail(states)
values = (value1, map(unsafe_get, state[3])...) # safe if not done(P, state)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state[3] should be equal to nvalues here. However, I came across a bug where, if I write nvalues here, it seems like it is taking the "value" of nvalues after it has been overwritten in the next block of code. In particular, in the final state before being done, none of the elements in nvalues will be isnull, but then, after the next block of code, they are.

Put differently, this function produced a different answer when I just call it versus when I run it line by line in the REPL. Or yet differently, if you replace state[3] by nvalues in this place, which should be a valid replacement, the tests fail outside of an actual test. I will analyse further and file a more complete bug description if I am convinced of my case.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@Jutho
Copy link
Contributor Author

Jutho commented Aug 4, 2017

I think the benchmarks are ok now? The ["scalar","cos"] benchmark seems to be an anomaly. I cannot reproduce it locally. Sometimes another test out of that group differs, but these are elementary cos calculations where iterators are not involved, and whose timings are very short but probably very fluctuating.

The ["random","ranges",...] benchmark does indeed allocate 4 more bytes (on a total of 420 bytes). I don't see immediately where product iterator is involved, but the effect on the run time is actually beneficial (though probably negligible).

All the other differences seem to be improvements.

@StefanKarpinski
Copy link
Member

LGTM, but I'll leave it to @JeffBezanson to pull the trigger on this one.

@Jutho
Copy link
Contributor Author

Jutho commented Aug 4, 2017

That's great but hold off merging until I have further investigated the issue about which I wrote a comment in the code.

@Jutho
Copy link
Contributor Author

Jutho commented Aug 4, 2017

OK, I am not able to reproduce it on my desktop computer; maybe it was something in my environment. I've pushed the last commit restoring the code to how it was before the error (also removing the superfluous use of tuple).

@Jutho
Copy link
Contributor Author

Jutho commented Aug 4, 2017

I don't know what happend with appveyor. On Travis, 64bit linux errored on an Arpack failure with a test of eigs and 32 bit linux timed out (even though it seemed actually finished, successfully).

Regarding the eigs error, definitely unrelated to this pr, but:

(eigs(speye(50), nev=10))[1]  ones(10)

seems prone to give errors. Trying to build a Krylov subspace with an identity matrix is asking for trouble, as every new vector created will be identical to the previous one and therefore become zero after orthogonalization. Not saying that Arpack shouldn't account for that, but it can be rather buggy software and this test will basically trigger that behaviour.

I would recommend using another diagonal matrix as a test.

@JeffBezanson
Copy link
Member

How does this version do on the original benchmarks in #22989 (comment) ? No need to post numbers, just want to double-check if the story is roughly the same.

@Jutho
Copy link
Contributor Author

Jutho commented Aug 7, 2017

Roughly the same; maybe a little bit regression with respect to CartesianRange but certainly still type stable and usable up to N=14. This is the comparison with CartesianRange (on a different machine than previous benchmarks):

pr cart
Trial(9.278 ns) Trial(7.994 ns)
Trial(113.876 ns) Trial(89.971 ns)
Trial(391.584 ns) Trial(369.146 ns)
Trial(1.842 μs) Trial(1.938 μs)
Trial(9.488 μs) Trial(6.697 μs)
Trial(43.086 μs) Trial(28.883 μs)
Trial(184.241 μs) Trial(123.112 μs)
Trial(740.347 μs) Trial(545.719 μs)
Trial(4.873 ms) Trial(2.255 ms)
Trial(15.256 ms) Trial(9.922 ms)
Trial(70.886 ms) Trial(73.226 ms)
Trial(332.448 ms) Trial(208.451 ms)
Trial(1.324 s) Trial(851.376 ms)
Trial(5.863 s) Trial(3.862 s)

Unfortunately trying to run the above benchmark on latest master yields:

ERROR: syntax: invalid syntax (escape (call (outerref mycount) ##iter1#776))

which I guess is an issue with BenchmarkTools.jl

@JeffBezanson JeffBezanson merged commit 66a505d into JuliaLang:master Aug 7, 2017
@Jutho
Copy link
Contributor Author

Jutho commented Aug 7, 2017

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants