`fast_maximum` also for `logsumexp` #456

Sleort · 2023-01-04T10:19:05Z

In the spirit of #450 ...

In the spirit of FluxML#450 ...

CarloLucibello · 2023-01-04T10:32:58Z

Can you make some benchmarks? just to gauge the impact of this

Sleort · 2023-01-04T13:40:14Z

Hmm... It turns out that #450 actually is a performance regression in some cases. While we have

julia> let x = randn(Float32, 100, 1000)
           @btime old_softmax($x; dims=1)  #with maximum
           @btime softmax($x; dims=1)  #with fast_maximum
       end;
  1.155 ms (7 allocations: 398.91 KiB)
  504.926 μs (3 allocations: 394.73 KiB)

the ranking is reversed when I set dims=::

julia> let x = randn(Float32, 100, 1000)
           @btime old_softmax($x; dims=:)
           @btime softmax($x; dims=:)
       end;
  659.604 μs (4 allocations: 390.70 KiB)
  691.194 μs (13 allocations: 390.92 KiB)

The same pattern can be seen for logsumexp:

julia> let x = randn(Float32, 100, 1000)
           @btime logsumexp($x; dims=1)  #with maximum
           @btime new_logsumexp($x; dims=1) #with fast_maximum
           @btime logsumexp($x; dims=:)
           @btime new_logsumexp($x; dims=:)
       end;
  833.593 μs (8 allocations: 402.97 KiB)
  616.084 μs (5 allocations: 402.86 KiB)
  578.131 μs (4 allocations: 390.70 KiB)
  654.552 μs (18 allocations: 391.02 KiB)

This is because fast_maximum is not always the fastest (!):

julia> let x = randn(Float32, 100, 1000)
           @btime maximum($x; dims=1)
           @btime fast_maximum($x; dims=1)
           @btime maximum($x; dims=:)
           @btime fast_maximum($x; dims=:)
       end;
  416.025 μs (4 allocations: 4.17 KiB)
  73.618 μs (1 allocation: 4.06 KiB)
  55.362 μs (0 allocations: 0 bytes)  #maximum(x; dims=:) is faster than
  111.392 μs (0 allocations: 0 bytes)  #... fast_maximuM(x; dims=:)

So maybe we should make fast_maximum dispatch on dims, like

fast_maximum(x; dims) = fast_maximum(x, dims)
fast_maximum(x::AbstractArray{T}, dims) where {T} = @fastmath reduce(max, x; dims, init = float(T)(-Inf))
fast_maximum(x::AbstractArray{T}, ::Colon) where {T} = maximum(x)

?

CarloLucibello · 2023-01-04T15:46:58Z

So maybe we should make fast_maximum dispatch on dims, like

seems worth doing, but in the meantime let's merge this and use fast_maximum consistently in the library

mcabbott · 2023-01-04T16:02:45Z

In softmax I'm reasonably sure that fastmax doesn't change the NaN behaviour. Is this true of logsumexp too?

I didn't see such clear improvements in logsumexp when I tried, so I thought simpler to leave it alone. The improvements shown above are modest.

The speeds of things are very computer-dependent, e.g. with the above test:

julia> fast_maximum(x::AbstractArray{T}; dims) where {T} = @fastmath reduce(max, x; dims, init = float(T)(-Inf))
fast_maximum (generic function with 1 method)

julia> let x = randn(Float32, 100, 1000)
           @btime maximum($x; dims=1)
           @btime fast_maximum($x; dims=1)
           @btime maximum($x; dims=:)
           @btime fast_maximum($x; dims=:)
       end;
  min 432.375 μs, mean 436.053 μs (4 allocations, 4.17 KiB)  # M1 mac
  min 7.552 μs, mean 8.109 μs (1 allocation, 4.06 KiB)
  min 63.583 μs, mean 64.105 μs (0 allocations)
  min 7.771 μs, mean 7.815 μs (0 allocations)

  425.427 μs (4 allocations: 4.17 KiB)  # ancient xeon
  29.582 μs (1 allocation: 4.06 KiB)
  97.440 μs (0 allocations: 0 bytes)
  17.075 μs (0 allocations: 0 bytes)

  32.003 μs (54 allocations: 2.36 KiB)  # CUDA.randn, timed with @btime CUDA.@sync
  31.400 μs (54 allocations: 2.36 KiB)
  67.226 μs (117 allocations: 5.52 KiB)
  66.480 μs (117 allocations: 5.52 KiB)

and

julia> let x = randn(Float32, 100, 1000)
           @btime logsumexp($x; dims=1)  #with maximum
           @btime new_logsumexp($x; dims=1) #with fast_maximum
           @btime logsumexp($x; dims=:)
           @btime new_logsumexp($x; dims=:)
       end;
  min 804.250 μs, mean 850.143 μs (8 allocations, 402.97 KiB)  # M1
  min 379.125 μs, mean 409.269 μs (5 allocations, 402.86 KiB)
  min 427.333 μs, mean 461.674 μs (2 allocations, 390.67 KiB)
  min 373.291 μs, mean 401.812 μs (17 allocations, 390.97 KiB)

  1.922 ms (8 allocations: 402.97 KiB)  # xeon
  1.524 ms (5 allocations: 402.86 KiB)
  1.552 ms (2 allocations: 390.67 KiB)
  1.615 ms (17 allocations: 390.97 KiB)

  85.918 μs (184 allocations: 8.47 KiB)  # CUDA.randn, timed with @btime CUDA.@sync
  85.080 μs (184 allocations: 8.47 KiB)
  179.253 μs (274 allocations: 12.75 KiB)
  194.902 μs (290 allocations: 13.06 KiB)

Am slightly disturbed that these have differing allocations.

fast_maximum also for logsumexp

000240d

In the spirit of FluxML#450 ...

Sleort mentioned this pull request Jan 4, 2023

Avoid maximum in softmax #450

Merged

CarloLucibello merged commit ccf1732 into FluxML:master Jan 4, 2023

Sleort deleted the patch-2 branch January 5, 2023 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fast_maximum` also for `logsumexp` #456

`fast_maximum` also for `logsumexp` #456

Sleort commented Jan 4, 2023

CarloLucibello commented Jan 4, 2023

Sleort commented Jan 4, 2023

CarloLucibello commented Jan 4, 2023

mcabbott commented Jan 4, 2023 •

edited

Loading

fast_maximum also for logsumexp #456

fast_maximum also for logsumexp #456

Conversation

Sleort commented Jan 4, 2023

CarloLucibello commented Jan 4, 2023

Sleort commented Jan 4, 2023

CarloLucibello commented Jan 4, 2023

mcabbott commented Jan 4, 2023 • edited Loading

`fast_maximum` also for `logsumexp` #456

`fast_maximum` also for `logsumexp` #456

mcabbott commented Jan 4, 2023 •

edited

Loading