-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fast_maximum
also for logsumexp
#456
Conversation
In the spirit of FluxML#450 ...
Can you make some benchmarks? just to gauge the impact of this |
Hmm... It turns out that #450 actually is a performance regression in some cases. While we have julia> let x = randn(Float32, 100, 1000)
@btime old_softmax($x; dims=1) #with maximum
@btime softmax($x; dims=1) #with fast_maximum
end;
1.155 ms (7 allocations: 398.91 KiB)
504.926 μs (3 allocations: 394.73 KiB) the ranking is reversed when I set julia> let x = randn(Float32, 100, 1000)
@btime old_softmax($x; dims=:)
@btime softmax($x; dims=:)
end;
659.604 μs (4 allocations: 390.70 KiB)
691.194 μs (13 allocations: 390.92 KiB) The same pattern can be seen for julia> let x = randn(Float32, 100, 1000)
@btime logsumexp($x; dims=1) #with maximum
@btime new_logsumexp($x; dims=1) #with fast_maximum
@btime logsumexp($x; dims=:)
@btime new_logsumexp($x; dims=:)
end;
833.593 μs (8 allocations: 402.97 KiB)
616.084 μs (5 allocations: 402.86 KiB)
578.131 μs (4 allocations: 390.70 KiB)
654.552 μs (18 allocations: 391.02 KiB) This is because julia> let x = randn(Float32, 100, 1000)
@btime maximum($x; dims=1)
@btime fast_maximum($x; dims=1)
@btime maximum($x; dims=:)
@btime fast_maximum($x; dims=:)
end;
416.025 μs (4 allocations: 4.17 KiB)
73.618 μs (1 allocation: 4.06 KiB)
55.362 μs (0 allocations: 0 bytes) #maximum(x; dims=:) is faster than
111.392 μs (0 allocations: 0 bytes) #... fast_maximuM(x; dims=:) So maybe we should make
? |
seems worth doing, but in the meantime let's merge this and use fast_maximum consistently in the library |
In I didn't see such clear improvements in The speeds of things are very computer-dependent, e.g. with the above test: julia> fast_maximum(x::AbstractArray{T}; dims) where {T} = @fastmath reduce(max, x; dims, init = float(T)(-Inf))
fast_maximum (generic function with 1 method)
julia> let x = randn(Float32, 100, 1000)
@btime maximum($x; dims=1)
@btime fast_maximum($x; dims=1)
@btime maximum($x; dims=:)
@btime fast_maximum($x; dims=:)
end;
min 432.375 μs, mean 436.053 μs (4 allocations, 4.17 KiB) # M1 mac
min 7.552 μs, mean 8.109 μs (1 allocation, 4.06 KiB)
min 63.583 μs, mean 64.105 μs (0 allocations)
min 7.771 μs, mean 7.815 μs (0 allocations)
425.427 μs (4 allocations: 4.17 KiB) # ancient xeon
29.582 μs (1 allocation: 4.06 KiB)
97.440 μs (0 allocations: 0 bytes)
17.075 μs (0 allocations: 0 bytes)
32.003 μs (54 allocations: 2.36 KiB) # CUDA.randn, timed with @btime CUDA.@sync
31.400 μs (54 allocations: 2.36 KiB)
67.226 μs (117 allocations: 5.52 KiB)
66.480 μs (117 allocations: 5.52 KiB) and julia> let x = randn(Float32, 100, 1000)
@btime logsumexp($x; dims=1) #with maximum
@btime new_logsumexp($x; dims=1) #with fast_maximum
@btime logsumexp($x; dims=:)
@btime new_logsumexp($x; dims=:)
end;
min 804.250 μs, mean 850.143 μs (8 allocations, 402.97 KiB) # M1
min 379.125 μs, mean 409.269 μs (5 allocations, 402.86 KiB)
min 427.333 μs, mean 461.674 μs (2 allocations, 390.67 KiB)
min 373.291 μs, mean 401.812 μs (17 allocations, 390.97 KiB)
1.922 ms (8 allocations: 402.97 KiB) # xeon
1.524 ms (5 allocations: 402.86 KiB)
1.552 ms (2 allocations: 390.67 KiB)
1.615 ms (17 allocations: 390.97 KiB)
85.918 μs (184 allocations: 8.47 KiB) # CUDA.randn, timed with @btime CUDA.@sync
85.080 μs (184 allocations: 8.47 KiB)
179.253 μs (274 allocations: 12.75 KiB)
194.902 μs (290 allocations: 13.06 KiB) Am slightly disturbed that these have differing allocations. |
In the spirit of #450 ...