Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n-arg map performance #17321

Closed
davidagold opened this issue Jul 7, 2016 · 6 comments
Closed

n-arg map performance #17321

davidagold opened this issue Jul 7, 2016 · 6 comments
Labels
arrays [a, r, r, a, y, s] fold sum, maximum, reduce, foldl, etc. performance Must go faster regression Regression in behavior compared to a previous version

Comments

@davidagold
Copy link
Contributor

While working on map for NullableArrays (JuliaStats/NullableArrays.jl#128 (comment)), I came across some evidence that the current n-arg map implementation in Base might be leaving performance on the table. Here's a simple benchmark for the current Base implementation:

using BenchmarkTools
n = 1_000_000
As = [ rand(n) for i in 1:5 ]
f(xs...) = prod(xs)

julia> @benchmark map(f, As...)
BenchmarkTools.Trial: 
  samples:          155
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  7.63 mb
  allocs estimate:  10
  minimum time:     27.25 ms (0.00% GC)
  median time:      30.81 ms (0.00% GC)
  mean time:        32.41 ms (2.55% GC)
  maximum time:     67.57 ms (0.00% GC)

Here's the same benchmark for an alternative implementation (mymap in https://gist.github.com/davidagold/d7088aae22f23d383e5bf1f26aa1a045) that (1) avoids using ith_all to index into the As and (2) avoids ziping the As together in the construction of a Generator(f, As...) object:

julia> @benchmark mymap(f, As...)
BenchmarkTools.Trial: 
  samples:          814
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  7.63 mb
  allocs estimate:  8
  minimum time:     4.58 ms (0.00% GC)
  median time:      5.44 ms (0.00% GC)
  mean time:        6.15 ms (13.94% GC)
  maximum time:     13.49 ms (22.69% GC)

julia> mymap(f, As...) == map(f, As...)
true

In this case, mymap is 5x faster. However, its implementation involves the use of a macro and generated functions in place of ith_all. Is the speed up here worth introducing such changes into the Base implementation?

cc @nalimilan

@Sacha0
Copy link
Member

Sacha0 commented Jul 7, 2016

broadcast performance appears much better:

julia> @benchmark map(f, As...)
BenchmarkTools.Trial:
  samples:          165
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  7.63 mb
  allocs estimate:  10
  minimum time:     27.52 ms (0.00% GC)
  median time:      30.26 ms (0.00% GC)
  mean time:        30.37 ms (2.57% GC)
  maximum time:     39.38 ms (7.84% GC)

julia> @benchmark broadcast(f, As...)
BenchmarkTools.Trial:
  samples:          887
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  7.63 mb
  allocs estimate:  83
  minimum time:     4.57 ms (0.00% GC)
  median time:      4.89 ms (0.00% GC)
  mean time:        5.63 ms (13.98% GC)
  maximum time:     9.60 ms (26.82% GC)

Again I wonder whether map and broadcast should be collapsed (#4883 (comment)). Best!

@JeffBezanson
Copy link
Member

Is the speed up here worth introducing such changes into the Base implementation?

No. I'm convinced we can optimize zip and generators sufficiently. See for example #15648, which discusses more general problems with iterating over multiple arrays. There are also already significant improvements in #16622 just from adding some inline declarations.

In any case, performance hacks should be directed at zip. For example, we might need a specialization that shares an index among multiple arrays, as LLVM (understandably) seems to have a hard time proving that two counters always have the same value.

Also noting that ith_all is only used by map!, not map.

@davidagold
Copy link
Contributor Author

Got it. Thank you for the resources, Jeff!

@KristofferC
Copy link
Member

This regressed quite a lot

julia> @benchmark map(f, As...)
BenchmarkTools.Trial:
  memory estimate:  1.02 GiB
  allocs estimate:  30998464
  --------------
  minimum time:     1.932 s (5.44% GC)

@KristofferC KristofferC added the regression Regression in behavior compared to a previous version label Oct 18, 2018
@oscardssmith
Copy link
Member

Bumping this. Anything we can do here?

@oscardssmith oscardssmith added the fold sum, maximum, reduce, foldl, etc. label Dec 31, 2020
@vtjnash
Copy link
Member

vtjnash commented Jan 8, 2021

Is there anything to do here? I just ran the above on my machine, and saw 2-4ms on everything posted above.

@vtjnash vtjnash closed this as completed Jan 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] fold sum, maximum, reduce, foldl, etc. performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

7 participants