Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map on tuples is prohibitively slow #15695

Closed
shashi opened this issue Mar 30, 2016 · 16 comments
Closed

map on tuples is prohibitively slow #15695

shashi opened this issue Mar 30, 2016 · 16 comments
Labels
performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks

Comments

@shashi
Copy link
Contributor

shashi commented Mar 30, 2016

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.0-dev+3333 (2016-03-30 03:42 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 4a13500* (0 days old master)
|__/                   |  x86_64-linux-gnu

julia> t = ntuple(x->x,1000);

julia> @time map(-, t)
488.491589 seconds (4.79 M allocations: 227.712 MB, 0.01% gc time)

julia> @time map(-, t)
  0.065904 seconds (620.06 k allocations: 26.750 MB, 6.09% gc time)

Two alternative implementations (for tuples with > 3 elements) would be:

julia> function map1(f, t::Tuple)
           ([f(x) for x in t]...)
       end
map1 (generic function with 1 method)

julia> @time map1(-, t);
  0.028094 seconds (9.86 k allocations: 449.154 KB)

julia> @time map1(-, t);
  0.000096 seconds (498 allocations: 39.500 KB)

# this one is by @SimonDanisch
@generated function map2(f, x::Tuple)
              len = length(x.parameters)
              expr = [:(f(x[$i])) for i=1:len]
              :(tuple($(expr...)))
       end

julia> @time map2(-, t);
 10.158751 seconds (233.77 k allocations: 10.337 MB)

julia> @time map2(-, t);
  0.000030 seconds (6 allocations: 8.047 KB)
@simonster simonster added the performance Must go faster label Mar 30, 2016
@SimonDanisch
Copy link
Contributor

Is that a case for this tail call optimization everyone's talking about?!

@JeffBezanson
Copy link
Member

No.

@Keno
Copy link
Member

Keno commented Mar 30, 2016

+1. I had one map on tuples in JuliaParser and it slowed down the whole thing by about 2-3 orders of magnitude.

@SimonDanisch
Copy link
Contributor

Oh I see, constructing the tuple is the last action, not the call to map...

@JeffBezanson
Copy link
Member

We could probably get the equivalent of the generated map2 function by adding inline declarations, but at some point we'll still want an iterative version.

@SimonDanisch
Copy link
Contributor

So stack allocated Ref plus setindex!(::Ref{Tuple})? Is that still the trick we want to apply for this kind of problem?

@Keno
Copy link
Member

Keno commented Mar 30, 2016

Yes

@JeffBezanson
Copy link
Member

Yes, but the comprehension version might have to do in the mean time.

related: #14126 #15516

Probably all the various definitions for map on tuples should go away, and be replaced by #15516 with a generator. That way we can consolidate a clever implementation in one place.

shashi added a commit to JuliaParallel/Dagger.jl that referenced this issue Mar 30, 2016
@shashi
Copy link
Contributor Author

shashi commented Mar 30, 2016

Similarly map(f, ::Tuple...) method seems to be slower than the comprehension alternative

function map2(f, ts::Tuple...)
    ([f([t[i] for t in ts]...) for i in 1:length(ts[1])]...)
end
julia> function foo(f, n)
           t = ntuple(x->x, n)
           @time f(+, t,t)
           @time f(+, t,t,t)
           @time f(+, t,t,t,t)
       end
julia> foo(map, 2)
  0.003090 seconds (725 allocations: 35.893 KB)
  1.004426 seconds (1.73 M allocations: 66.629 MB, 1.60% gc time)
  0.002223 seconds (162 allocations: 7.375 KB)
(4,8)

julia> foo(map, 2)
  0.000001 seconds (1 allocation: 32 bytes) # this has a special case
  0.000133 seconds (86 allocations: 3.516 KB)
  0.000146 seconds (117 allocations: 4.375 KB)
(4,8)
julia> foo(map2, 2)
  0.024458 seconds (23.03 k allocations: 997.798 KB)
  0.000011 seconds (4 allocations: 352 bytes)
  0.000006 seconds (4 allocations: 352 bytes)
(4,8)

julia> foo(map2, 2)
  0.000023 seconds (4 allocations: 320 bytes)
  0.000006 seconds (4 allocations: 352 bytes)
  0.000005 seconds (4 allocations: 352 bytes)
(4,8)
julia> foo(map, 3)
  0.008495 seconds (4.66 k allocations: 232.674 KB)
  0.450469 seconds (699.39 k allocations: 27.190 MB, 2.75% gc time)
  0.000150 seconds (186 allocations: 7.500 KB)
(4,8,12)

julia> foo(map, 3)
  0.000039 seconds (11 allocations: 672 bytes)
  0.000144 seconds (132 allocations: 5.500 KB)
  0.000132 seconds (182 allocations: 7.125 KB)
(4,8,12)
julia> foo(map2, 3)
  0.025736 seconds (23.19 k allocations: 1007.829 KB)
  0.000010 seconds (5 allocations: 480 bytes)
  0.000005 seconds (5 allocations: 480 bytes)
(4,8,12)

julia> foo(map2, 3)
  0.000017 seconds (5 allocations: 432 bytes)
  0.000006 seconds (5 allocations: 480 bytes)
  0.000005 seconds (5 allocations: 480 bytes)
(4,8,12)

@JeffBezanson
Copy link
Member

Out of curiosity --- how did this come up? Was there an interesting use case?

@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Mar 30, 2016
@shashi
Copy link
Contributor Author

shashi commented Mar 30, 2016

In ComputeFramework there's a type called Thunk which holds a function to run and the arguments to the function which are in a tuple. https://github.com/shashi/ComputeFramework.jl/blob/07fd1a415f666a04ae431862405c660df525aab6/src/thunk.jl#L10-L11

the arguments can themselves be thunks. The scheduler runs a thunk when all input thunks are computed: JuliaParallel/Dagger.jl@93c7e79 (it was using map to look up results, no particular reason)

sometimes the inputs tuple can become pretty big. (e.g. sum on a matrix of size 10^5x10^4 split up into 10^3x10^3 chunks produces 10^3 inputs to a single thunk - so it would just get stuck at the map in the last step of consolidating the sums)

I picked tuples because I could do things like this:
https://github.com/shashi/ComputeFramework.jl/blob/master/src/thunk.jl#L28-L44 (fuse two thunks if second thunk is the only input to the first one) but I don't really use that, because that subverts other optimizations by the scheduler which might use the second thunk.

Do you think inputs as Array{Any} is a better alternative? if tuples are faster in most common cases (1 or 2 inputs), I should keep them. I will need to take a closer look at that before changing.

tl; dr: I didn't need to use map here.

@timholy
Copy link
Member

timholy commented Mar 31, 2016

There's a tension between efficiency of compilation and efficiency of production. Currently there's an easy test case:

julia> @time ntuple(identity, 200);
  0.010773 seconds (4.50 k allocations: 324.949 KB)

julia> @time ntuple(identity, Val{200});
  4.913740 seconds (359.06 k allocations: 23.602 MB, 0.48% gc time)

The Val version uses recursive compilation, so this is just hammering the compiler.

However, for those of us who work on core array code, we use tuples that tend to be of the length of the dimensionality of an array, not the contents of the array. So our tuples are typically of length 2-4. In which case, this is informative:

julia> sum3(t::NTuple{3}) = t[1] + t[2] + t[3]
sum3 (generic function with 1 method)

julia> function tuple1(n)
           s = 0
           for i = 1:n
               t = ntuple(identity, 3)
               s += sum3(t)
           end
           s
       end
tuple1 (generic function with 1 method)

julia> function tuple2(n)
           s = 0
           for i = 1:n
               t = ntuple(identity, Val{3})
               s += sum3(t)
           end
           s
       end
tuple2 (generic function with 1 method)

julia> tuple1(1)
6

julia> tuple2(1)
6

julia> @time tuple1(10^6)
  0.145727 seconds (2.00 M allocations: 30.515 MB, 12.31% gc time)
6000000

julia> @time tuple2(10^6)
  0.005430 seconds (7 allocations: 208 bytes)
6000000

In other words, there is currently a tension between the needs for users of small tuples and those hoping to use large tuples.

@shashi
Copy link
Contributor Author

shashi commented Mar 31, 2016

Just to be clear, I didn't need to use large tuples in my case. I just happened to because I didn't know this would be a problem.

I agree that for smaller tuples, there should be specific methods (or generated ones) so that core array code dealing with small tuples is really fast, but if a user does (maybe inadvertently) use a bigger tuple something like map should not be so surprisingly slow.

@timholy
Copy link
Member

timholy commented Mar 31, 2016

Not a bad plan. I was hoping to get away from "breaks" in performance at n=5 (for example), but maybe the alternative is worse.

@timholy
Copy link
Member

timholy commented Mar 31, 2016

I should acknowledge that one valid criticism of my benchmark above is that in tuple1, t is not type-stable. So I tested this:

julia> ntuple2{N}(f, ::Type{Val{N}}) = ([f(i) for i = 1:N]...)::NTuple{N,typeof(f(1))}
ntuple2 (generic function with 1 method)

and verified that the return type is predictable, switched tuple1 to use it, and verified that tuple1 and tuple2 produce machine code that looks very similar upon casual inspection. However, the same benchmark ran in 0.5 seconds, actually worse than the type-unstable version (100-fold slower than tuple2), presumably because of the intermediate array creation.

@shashi
Copy link
Contributor Author

shashi commented Jun 15, 2016

#16460 fixes this...

@shashi shashi closed this as completed Jun 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

No branches or pull requests

7 participants