Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast_maximum also for logsumexp #456

Merged
merged 1 commit into from
Jan 4, 2023
Merged

Conversation

Sleort
Copy link
Contributor

@Sleort Sleort commented Jan 4, 2023

In the spirit of #450 ...

@CarloLucibello
Copy link
Member

Can you make some benchmarks? just to gauge the impact of this

@Sleort
Copy link
Contributor Author

Sleort commented Jan 4, 2023

Hmm... It turns out that #450 actually is a performance regression in some cases. While we have

julia> let x = randn(Float32, 100, 1000)
           @btime old_softmax($x; dims=1)  #with maximum
           @btime softmax($x; dims=1)  #with fast_maximum
       end;
  1.155 ms (7 allocations: 398.91 KiB)
  504.926 μs (3 allocations: 394.73 KiB)

the ranking is reversed when I set dims=::

julia> let x = randn(Float32, 100, 1000)
           @btime old_softmax($x; dims=:)
           @btime softmax($x; dims=:)
       end;
  659.604 μs (4 allocations: 390.70 KiB)
  691.194 μs (13 allocations: 390.92 KiB)

The same pattern can be seen for logsumexp:

julia> let x = randn(Float32, 100, 1000)
           @btime logsumexp($x; dims=1)  #with maximum
           @btime new_logsumexp($x; dims=1) #with fast_maximum
           @btime logsumexp($x; dims=:)
           @btime new_logsumexp($x; dims=:)
       end;
  833.593 μs (8 allocations: 402.97 KiB)
  616.084 μs (5 allocations: 402.86 KiB)
  578.131 μs (4 allocations: 390.70 KiB)
  654.552 μs (18 allocations: 391.02 KiB)

This is because fast_maximum is not always the fastest (!):

julia> let x = randn(Float32, 100, 1000)
           @btime maximum($x; dims=1)
           @btime fast_maximum($x; dims=1)
           @btime maximum($x; dims=:)
           @btime fast_maximum($x; dims=:)
       end;
  416.025 μs (4 allocations: 4.17 KiB)
  73.618 μs (1 allocation: 4.06 KiB)
  55.362 μs (0 allocations: 0 bytes)  #maximum(x; dims=:) is faster than
  111.392 μs (0 allocations: 0 bytes)  #... fast_maximuM(x; dims=:)

So maybe we should make fast_maximum dispatch on dims, like

fast_maximum(x; dims) = fast_maximum(x, dims)
fast_maximum(x::AbstractArray{T}, dims) where {T} = @fastmath reduce(max, x; dims, init = float(T)(-Inf))
fast_maximum(x::AbstractArray{T}, ::Colon) where {T} = maximum(x)

?

@Sleort Sleort mentioned this pull request Jan 4, 2023
@CarloLucibello
Copy link
Member

So maybe we should make fast_maximum dispatch on dims, like

seems worth doing, but in the meantime let's merge this and use fast_maximum consistently in the library

@CarloLucibello CarloLucibello merged commit ccf1732 into FluxML:master Jan 4, 2023
@mcabbott
Copy link
Member

mcabbott commented Jan 4, 2023

In softmax I'm reasonably sure that fastmax doesn't change the NaN behaviour. Is this true of logsumexp too?

I didn't see such clear improvements in logsumexp when I tried, so I thought simpler to leave it alone. The improvements shown above are modest.

The speeds of things are very computer-dependent, e.g. with the above test:

julia> fast_maximum(x::AbstractArray{T}; dims) where {T} = @fastmath reduce(max, x; dims, init = float(T)(-Inf))
fast_maximum (generic function with 1 method)

julia> let x = randn(Float32, 100, 1000)
           @btime maximum($x; dims=1)
           @btime fast_maximum($x; dims=1)
           @btime maximum($x; dims=:)
           @btime fast_maximum($x; dims=:)
       end;
  min 432.375 μs, mean 436.053 μs (4 allocations, 4.17 KiB)  # M1 mac
  min 7.552 μs, mean 8.109 μs (1 allocation, 4.06 KiB)
  min 63.583 μs, mean 64.105 μs (0 allocations)
  min 7.771 μs, mean 7.815 μs (0 allocations)

  425.427 μs (4 allocations: 4.17 KiB)  # ancient xeon
  29.582 μs (1 allocation: 4.06 KiB)
  97.440 μs (0 allocations: 0 bytes)
  17.075 μs (0 allocations: 0 bytes)

  32.003 μs (54 allocations: 2.36 KiB)  # CUDA.randn, timed with @btime CUDA.@sync
  31.400 μs (54 allocations: 2.36 KiB)
  67.226 μs (117 allocations: 5.52 KiB)
  66.480 μs (117 allocations: 5.52 KiB)

and

julia> let x = randn(Float32, 100, 1000)
           @btime logsumexp($x; dims=1)  #with maximum
           @btime new_logsumexp($x; dims=1) #with fast_maximum
           @btime logsumexp($x; dims=:)
           @btime new_logsumexp($x; dims=:)
       end;
  min 804.250 μs, mean 850.143 μs (8 allocations, 402.97 KiB)  # M1
  min 379.125 μs, mean 409.269 μs (5 allocations, 402.86 KiB)
  min 427.333 μs, mean 461.674 μs (2 allocations, 390.67 KiB)
  min 373.291 μs, mean 401.812 μs (17 allocations, 390.97 KiB)

  1.922 ms (8 allocations: 402.97 KiB)  # xeon
  1.524 ms (5 allocations: 402.86 KiB)
  1.552 ms (2 allocations: 390.67 KiB)
  1.615 ms (17 allocations: 390.97 KiB)

  85.918 μs (184 allocations: 8.47 KiB)  # CUDA.randn, timed with @btime CUDA.@sync
  85.080 μs (184 allocations: 8.47 KiB)
  179.253 μs (274 allocations: 12.75 KiB)
  194.902 μs (290 allocations: 13.06 KiB)

Am slightly disturbed that these have differing allocations.

@Sleort Sleort deleted the patch-2 branch January 5, 2023 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants