Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster reverse for tuples #16604

Closed
wants to merge 3 commits into from
Closed

Faster reverse for tuples #16604

wants to merge 3 commits into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented May 26, 2016

The proposed change in reverse(t::Tuple) avoids recursion and in particular calling revargs for varying number of arguments which especially slows down the first execution of calls like @time reverse(tuple(1:1000...));. The change makes definition of revargs unnecessary.

The proposed change in `reverse(t::Tuple)` avoids recursion and in particular calling `revargs` for varying number of arguments which especially slows down the first execution of calls like `@time reverse(tuple(1:1000...));`. The change makes definition of `revargs` unnecessary.
@TotalVerb
Copy link
Contributor

TotalVerb commented May 26, 2016

👍. The original code was a cute Scheme analogy, but I don't think it's right for Julia.

My benchmarking indicates that

reverse(t::Tuple) = ([t[i] for i in length(t):-1:1]...)

is slightly faster. It's a bit messier though.

@yuyichao
Copy link
Contributor

#15695 note that this is not type stable and will be very bad for small tuples

@yuyichao yuyichao added performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks labels May 26, 2016
@JeffBezanson
Copy link
Member

Yes, unfortunately this definition will be a disaster for small tuples.

@TotalVerb
Copy link
Contributor

TotalVerb commented May 26, 2016

For (heterogenous and homogenous) tuples of length 5 or greater, the current reverse is worse than the modified proposed one. You're right that it's better for small tuples, by quite a bit for size 1 and 2 tuples. Presumably those tuples are the most important ones.

@TotalVerb
Copy link
Contributor

TotalVerb commented May 27, 2016

This seems to work:

reverse(t::Tuple{}) = t
reverse{T}(t::Tuple{T}) = t
reverse{T,U}(t::Tuple{T,U}) = t[2], t[1]
reverse{T,U,V}(t::Tuple{T,U,V}) = t[3], t[2], t[1]
reverse{T,U,V,W}(t::Tuple{T,U,V,W}) = t[4], t[3], t[2], t[1]
reverse(t::Tuple) = ([t[i] for i in length(t):-1:1]...)

Is this considered too messy?

@JeffBezanson
Copy link
Member

Yes, we will probably need to go with something like that. A slightly better trick can be seen in https://github.com/JuliaLang/julia/pull/16460/files#diff-fd99c9a15dadc23d9200069dadafb8bbR90
It's messy but I don't know a better way to do it so far.

@TotalVerb
Copy link
Contributor

That is indeed a nice trick. So only one additional line is needed:

reverse(t::Tuple{Any,Any,Any,Any,Any,Vararg{Any}}) = ([t[i] for i in length(t):-1:1]...)

This seems to have the best performance. I wonder why the size of tuple to reach performance parity is so much smaller in this case than with map.

@Jutho
Copy link
Contributor

Jutho commented May 27, 2016

It seems that the definition:

reverse{N}(t::NTuple{N,Any}) = tuple(reverse(collect(t))...)::NTuple{N,Any}

is at least type-stable for homogeneous tuples. It infers the type from the reverse(collect(t)) construction and combines it with the length in the type assertion.

@bkamins
Copy link
Member Author

bkamins commented May 28, 2016

Thank you all for dedication to squeeze maximum performance out of every part of Julia 👍.

Having learned again how little I know about Julia I have filed an issue #16631, and not PR :), for two other cases that have performance issues related to large tuples.

@musm
Copy link
Contributor

musm commented Dec 4, 2016

the proposals in this thread seem to be an improvement over the original, should someone update this PR?

@bkamins
Copy link
Member Author

bkamins commented Dec 4, 2016

👍. I did not update PR myself as I feel that someone with better understanding of Julia should choose the best approach to fix reverse.

@JeffBezanson
Copy link
Member

The idea here looks good to me, since we just have to add one line: #16604 (comment)

@TotalVerb
Copy link
Contributor

@JeffBezanson There is still the type stability problem to deal with though, which I didn't measure in my benchmarks (since it only affects downstream consumers) and is harder to measure in general. Perhaps we should keep the same cutoff as map?

@TotalVerb
Copy link
Contributor

There is also @Jutho's suggestion which preserves type stability for homogenous tuples

reverse{N}(t::NTuple{N,Any}) = tuple(reverse(collect(t))...)::NTuple{N,Any}

which may perhaps be useful to work in.

@JeffBezanson
Copy link
Member

I think it will work to add ::NTuple{nfields(t),Any} to the comprehension version.

@martinholters
Copy link
Member

How about

@generated reverse(x::Tuple) = Expr(:tuple, (:(x[$i]) for i in nfields(x):-1:1)...)

? Fast and type-stable.

@TotalVerb
Copy link
Contributor

This issue is mostly about compile time, and it's highly doubtful that has better compile time.

@martinholters
Copy link
Member

This issue is mostly about compile time, and it's highly doubtful that has better compile time.

Oh. Really? With the OP's example:

julia> @benchmark reverse($(tuple(1:1000...))) # master
BenchmarkTools.Trial: 
  memory estimate:  18.17 mb
  allocs estimate:  373638
  --------------
  minimum time:     18.886 ms (0.00% GC)
  median time:      20.347 ms (0.00% GC)
  mean time:        23.854 ms (14.99% GC)
  maximum time:     105.331 ms (81.43% GC)
  --------------
  samples:          210
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark myreverse($(tuple(1:1000...))) # as proposed above
BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     709.319 ns (0.00% GC)
  median time:      714.773 ns (0.00% GC)
  mean time:        727.437 ns (0.00% GC)
  maximum time:     1.767 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     141
  time tolerance:   5.00%
  memory tolerance: 1.00%

Of course, this does not include compile time, but the execution time is improved dramatically.

@TotalVerb
Copy link
Contributor

Yikes, that is a big difference.

@KristofferC
Copy link
Member

KristofferC commented Dec 7, 2016

This reminds me of: #15702 (comment).

Here is another benchmark calling the reverse function for the first time (so runtime + compilation time):

v = Float64[]
for i in 1:100
   t = (rand(i)...)    
   push!(v, @elapsed reverse(t))
end

image

@bkamins
Copy link
Member Author

bkamins commented Sep 13, 2017

I wanted to go back to this PR and resolve it.
Given the above discussion my recommendation would be:

  • I opt against @generated version as it generates StackOverflow for very large tuples (and is unacceptably slow to compile for for large tuples below StackOverflow level)
  • Have specialized methods for small tuples (I chose up to 4 elements - as in recommendation above) which is type stable
  • Have a separate method for NTuple as it is type stable and the most probable use case for large tuples will be NTuple
  • Accept that for large tuples reverse might not be type stable

As an additional issue I would like to ask what is the recommended place to document reverse in general, as now ?reverse prints help only for AbstractString.

base/tuple.jl Outdated

reverse(t::Tuple) = revargs(t...)
reverse(t::Tuple{}) = t
reverse(t::Tuple{T}) where {T} = t
Copy link
Member

@mbauman mbauman Sep 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or more simply t::NTuple{1} and so on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly Tuple{T} is equivalent to NTuple{1}, but for larger number of arguments it would not be equivalent. I add a specialized function for NTuple{N} as it ensures type stability for large N.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, I think it needs to be NTuple{N, Any} to allow for heterogeneous tuples. I think that should be equivalent.

@TotalVerb
Copy link
Contributor

As @KristofferC pointed out, the cutoff here should probably be something bigger than 18 and not 4.

@bkamins
Copy link
Member Author

bkamins commented Sep 15, 2017

@TotalVerb I agree that the break-point is at ~20 for single execution of reverse, and even for 100 elements in a tuple the generation of specialized function is not prohibitively slow (which would give a benefit when reverse is called many times).
I thought that as we have to settle for a suboptimal solution then it is best to keep it relatively simple.

Of course it is simple to extend the definitions to generation of a larger number predefined reverses - would you judge it worthwhile? (and the question is if having very many such definitions would not have a negative impact on Julia method dispatch code speed in general - here I do not know)

@TotalVerb
Copy link
Contributor

@bkamins As Jeff mentioned, we may be able to keep the existing definition and simply add a new one, like

reverse(t::Tuple{Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Any,Vararg{Any}}) = ([t[i] for i in length(t):-1:1]...)::NTuple{nfields(t), Any}

to cover long cases. However this doesn't preserve the type stability of homogenous tuples.

@bkamins
Copy link
Member Author

bkamins commented Sep 16, 2017

@TotalVerb I have now unrolled the definitions for tuples not greater than 20. For greater ones the code for homogeneous tuples is type stable. Is this is what you had in mind?

@TotalVerb
Copy link
Contributor

Looks much better. @JeffBezanson and others more familiar with this should probably have a look.

@Jutho
Copy link
Contributor

Jutho commented Sep 24, 2017

Why don't follow the example of map, i.e. just keep the old code and add your definition using the many Anys trick:

revargs() = ()
revargs(x, r...) = (revargs(r...)..., x)
reverse(t::Tuple) = revargs(t...)
function reverse(t::Tuple{Any,Any,Any,Any,Any,Any,Any,Any,
                         Any,Any,Any,Any,Any,Any,Any,Any,Vararg{Any}})
    ([t[i] for i in length(t):-1:1]...)::NTuple{nfields(t)}
end

In fact, I would argue that there is one too many Any in both the code above and in the code of map, since type inference stops working for the lisp-like definition beyond 14 arguments. But other than that, for the cases where the last definition kicks in, return type declaration gives the correct number of fields, whereas type inference is still able to infer the type of the tuple is homogeneous. So the return type will be concrete for all NTuple{N,T}s.

This seems much better than code generation for generating all cases up to N=20 explicitly, in addition to the act of introducing an arbitrary new constant (i.e. 20).

@bkamins
Copy link
Member Author

bkamins commented Sep 24, 2017

Small comment - you need NTuple{nfields(t), Any} (as you proposed above) for it to work.

The reason is performance. This is a test I use (new_reverse is my proposed implementation):

function test(i)
    println("Testing tuple size ", i+1)
    for (e, s) in [("a", "mixed"), (0, "homogeneous")]
        t = (e, 1:i...)
        print("\tOld $s:")
        @btime reverse($t)
        print("\tNew $s:")
        @btime new_reverse($t)
    end
end

foreach(test, 1:3:25)

which produces:

Testing tuple size 2
        Old mixed:  10.264 ns (1 allocation: 32 bytes)
        New mixed:  10.263 ns (1 allocation: 32 bytes)
        Old homogeneous:  1.866 ns (0 allocations: 0 bytes)
        New homogeneous:  1.866 ns (0 allocations: 0 bytes)
Testing tuple size 5
        Old mixed:  1.819 μs (9 allocations: 240 bytes)
        New mixed:  12.596 ns (1 allocation: 48 bytes)
        Old homogeneous:  1.773 μs (9 allocations: 240 bytes)
        New homogeneous:  2.332 ns (0 allocations: 0 bytes)
Testing tuple size 8
        Old mixed:  2.955 μs (14 allocations: 480 bytes)
        New mixed:  18.681 ns (1 allocation: 80 bytes)
        Old homogeneous:  2.903 μs (13 allocations: 400 bytes)
        New homogeneous:  4.198 ns (0 allocations: 0 bytes)
Testing tuple size 11
        Old mixed:  5.443 μs (21 allocations: 816 bytes)
        New mixed:  21.525 ns (1 allocation: 96 bytes)
        Old homogeneous:  5.521 μs (21 allocations: 816 bytes)
        New homogeneous:  6.531 ns (0 allocations: 0 bytes)
Testing tuple size 14
        Old mixed:  6.881 μs (26 allocations: 1.17 KiB)
        New mixed:  22.860 ns (1 allocation: 128 bytes)
        Old homogeneous:  6.881 μs (25 allocations: 1.05 KiB)
        New homogeneous:  8.864 ns (0 allocations: 0 bytes)
Testing tuple size 17
        Old mixed:  3.791 μs (13 allocations: 1.08 KiB)
        New mixed:  22.393 ns (1 allocation: 144 bytes)
        Old homogeneous:  1.120 μs (2 allocations: 368 bytes)
        New homogeneous:  11.196 ns (0 allocations: 0 bytes)
Testing tuple size 20
        Old mixed:  4.332 μs (13 allocations: 1.14 KiB)
        New mixed:  27.992 ns (1 allocation: 176 bytes)
        Old homogeneous:  1.260 μs (2 allocations: 416 bytes)
        New homogeneous:  13.529 ns (0 allocations: 0 bytes)
Testing tuple size 23
        Old mixed:  4.199 μs (13 allocations: 1.34 KiB)
        New mixed:  3.674 μs (13 allocations: 1.34 KiB)
        Old homogeneous:  1.399 μs (2 allocations: 528 bytes)
        New homogeneous:  884.458 ns (2 allocations: 528 bytes)
Testing tuple size 26
        Old mixed:  4.332 μs (13 allocations: 1.38 KiB)
        New mixed:  3.791 μs (13 allocations: 1.38 KiB)
        Old homogeneous:  1.540 μs (2 allocations: 560 bytes)
        New homogeneous:  979.700 ns (2 allocations: 560 bytes)

The proposed method is uniformly faster for mixed tuples and for NTuples of small, moderate and large sizes. For tuples of moderate size (~10) the difference is very significant.

I agree that additional constant N=20 is arbitrary. I have chosen it so that compile time is still low, but it can be changed if such a high N is an issue. I am not qualified to fully understand if having too many methods for such a function is problematic (as I believe that reverse for tuples is not a top priority functionality).

@bkamins
Copy link
Member Author

bkamins commented Nov 11, 2021

I am not sure if anyone is willing to go back to this very old PR but I would add adding reverse to NamedTuple to the list, as currently:

julia> reverse((a=1,b=2))
ERROR: MethodError: no method matching reverse(::NamedTuple{(:a, :b), Tuple{Int64, Int64}})

(I can make a separate PR for reverse for NamedTuple if you feel the tuple issue that was originally raised can be left implemented as it is now and then I would close this PR)

@oscardssmith
Copy link
Member

Honestly, at this point, I think the best approach might be to close this PR and remake it. Performance characteristics change over time, so it's probably good to separate the new correct data from the original.

@bkamins
Copy link
Member Author

bkamins commented Nov 11, 2021

I would just close it then. Probably the issue should be resolved when someone hits this problem in practice (given that this PR is open for so long and no one complains probably this is not a serous practical issue).

Regarding reverse for NamedTuples I think it should be added but first an issue I have just opened #43045 probably should be resolved.

@bkamins bkamins closed this Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants