Add gradients for `conv_bias_act`, and a similar `dense_bias_act` #346

mcabbott · 2021-08-09T02:48:35Z

This aims to add gradient definitions for the existing conv_bias_act. That is, however, very much WIP, and I don't recommend anyone try to read it just yet.

It also adds an analogous dense_bias_act, which is closer to done. What this gains you over σ.(w*x .+ b) is memory savings. Zygote will by default un-fuse the broadcast, allocating 3 arrays on the forward pass, but in fact we can often over-write the result of w*x, saving 2 copies. This should happen both on CPU and GPU. There is one more copy you could save on the reverse pass, bringing you to 1/2 the memory usage of before, but only if you were sure that the pullback would only be called once. That isn't true for say Zygote.jacobian, and I don't think there's a way to know when it will be safe. So we save 1/3 not 1/2, when inside Zygote.

I say "often" because over-writing w*x only works when the gradient of σ can be written in terms of its output, without saving its input. That's true for tanh and relu and some others, which are ~~explicitly whitelisted here as INPLACE_ACTS. Surely a more extensible method for that could be invented.~~ now handled using JuliaDiff/ChainRulesCore.jl#453 .

This was written before seeing FluxML/NNlibCPU.jl#1 . But they may work well together -- for instance the function dense! there could (after we adjust signatures a little) simply overload a function here, providing a fast path when that package is loaded. Likewise it can overload conv_bias_act! to run a fused activation-and-convolution on the CPU, a bit like the existing NNlibCUDA routine. (From a first glance it looks like dense! has a trait for deciding which functions are in-place-safe, which is good.) Again, not fully baked, but opened now to start discussing.

test/activations.jl

darsnack · 2022-01-12T23:54:54Z

Is the original message up top still accurate? It looks like the implementation is there. What help is necessary to get this through?

mcabbott · 2022-01-13T02:05:05Z

My memory is that this basically worked, but the performance was disappointing due to JuliaLang/julia#43153 . Writing back into the same x (when safe) saved memory but not time, unless you pirated Base things as suggested there. (Which it looks like I didn't do on this branch?)

Edit: ok I've updated things. I think the most honest benchmark looks like this, and shows a serious improvement from tanh_fast. And one copy saved by bias_act!, but now avoiding a serious slowdown, but still slower than ideal, why 71 allocations?

julia> w, b = rand(Float32, 100, 100), rand(Float32, 100); x = rand(Float32, size(w)...);

julia> @btime gradient((w,x,b) -> sum(abs2, dense_bias_act(tanh, w, x, b)), wr[], $x, $b)  setup=(wr=Ref(randn(Float32,100,100))) evals=1;
  min 44.792 μs, mean 79.901 μs (71 allocations, 198.37 KiB)

julia> @btime gradient((w,x,b) -> sum(abs2, tanh.((w * x) .+ b)), wr[], $x, $b)  setup=(wr=Ref(randn(Float32,100,100))) evals=1;
  min 114.583 μs, mean 158.989 μs (39 allocations, 275.25 KiB)

julia> @btime gradient((w,x,b) -> sum(abs2, tanh_fast.((w * x) .+ b)), wr[], $x, $b)  setup=(wr=Ref(randn(Float32,100,100))) evals=1;
  min 40.125 μs, mean 75.140 μs (39 allocations, 275.25 KiB)

Would be worthwhile to benchmark on other computers. (This is M1 + apple's blas.) And on GPUs. And conv... and ideally in bigger examples, whatever happened to https://github.com/FluxML/FluxBench.jl ?

mcabbott · 2022-09-02T02:42:56Z

Rebased at https://github.com/mcabbott/NNlib.jl/tree/bias_act_22 after squashing, but its own tests fail.

This comment was marked as off-topic.

Sign in to view

mcabbott mentioned this pull request Sep 12, 2021

Add derivatives_given_input JuliaDiff/ChainRulesCore.jl#456

Closed

mcabbott force-pushed the activate branch 2 times, most recently from bcb3460 to 964dc16 Compare November 8, 2021 11:19

mcabbott commented Dec 13, 2021

View reviewed changes

test/activations.jl Outdated Show resolved Hide resolved

mcabbott mentioned this pull request Jan 12, 2022

Use NNlib.conv_bias_act for Conv FluxML/Flux.jl#1832

Draft

4 tasks

Michael Abbott and others added 14 commits January 13, 2022 12:27

sketch of activation broadcaster

73eec80

fixup

0be084b

more...

b074d45

fixup & benchmark

766fe55

tweaks

a0d4e5b

bias_act revival

9131d71

next day

86bafd4

rm tanh_fast

a082043

tidy a bit

6bb3df4

use derivatives_given_output

c2c7f25

simplify conv_bias_act

fbc4c7e

fix bad rebase

6fa38ea

use tanh_fast

bcf6803

trying to fix with piracy

7f66a9f

mcabbott force-pushed the activate branch from ea6490f to 7f66a9f Compare January 13, 2022 17:28

updates, no piracy

b071df0

mcabbott added the performance label Jul 22, 2022

mcabbott closed this Nov 24, 2022

mcabbott mentioned this pull request Jan 5, 2023

Add bias_act! #457

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gradients for `conv_bias_act`, and a similar `dense_bias_act` #346

Add gradients for `conv_bias_act`, and a similar `dense_bias_act` #346

mcabbott commented Aug 9, 2021 •

edited

Loading

This comment was marked as off-topic.

darsnack commented Jan 12, 2022 •

edited

Loading

mcabbott commented Jan 13, 2022 •

edited

Loading

mcabbott commented Sep 2, 2022

Add gradients for conv_bias_act, and a similar dense_bias_act #346

Add gradients for conv_bias_act, and a similar dense_bias_act #346

Conversation

mcabbott commented Aug 9, 2021 • edited Loading

This comment was marked as off-topic.

darsnack commented Jan 12, 2022 • edited Loading

mcabbott commented Jan 13, 2022 • edited Loading

mcabbott commented Sep 2, 2022

Add gradients for `conv_bias_act`, and a similar `dense_bias_act` #346

Add gradients for `conv_bias_act`, and a similar `dense_bias_act` #346

mcabbott commented Aug 9, 2021 •

edited

Loading

darsnack commented Jan 12, 2022 •

edited

Loading

mcabbott commented Jan 13, 2022 •

edited

Loading