Define activation functions taking arrays as input #423

theabhirath · 2022-06-19T09:07:35Z

An attempt to fix #422...hopefully this doesn't break anything

darsnack · 2022-06-19T11:21:35Z

Convolution fuzzing tests are already failing on master

DhairyaLGandhi · 2022-06-19T12:17:40Z

The canonical definition would almost always broadcast in the forwards pass, which GPUCompiler usually catches and dispatches to the kernels with array arguments anyway. Can we get an idea of how we dispatch to these when most forwards passes broadcast by default?

…

On Sun, Jun 19, 2022, 16:51 Kyle Daruwalla ***@***.***> wrote: Convolution fuzzing tests are already failing on master — Reply to this email directly, view it on GitHub <#423 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJOZVVKHC3ZOHFCKOQLUS53VP37EXANCNFSM5ZGCAU3Q> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

darsnack · 2022-06-19T13:12:16Z

What do you mean? The canonical definition of what?

If it's the layer forward definitions, then this doesn't change those. All this does is allow relu directly in a Chain instead of x -> relu.(x). And since the implementation is just broadcasting the function, it will hit the same paths.

DhairyaLGandhi · 2022-06-19T13:30:29Z

The forward passes in many cases looks like

act.(f(x))

Ie with the activation broadcasted over an object. In this case, the activation function provided has to be relu. When something like x -> relu.(x) is provided with the above form of the forwards pass, the anonymous function is already broadcasted and generates

(x -> relu.(x)).(f(x))

Thus what actually happens is that the anonymous function actually receives a scalar and works anyway since numbers are iterable. I think in most cases it doesn't make much difference but that is what is happening unless the compiler can optimise the extra broadcast away. In AD, we would actually have to see both the outer broadcast and the inner broadcast and generate pullback for both (this hopefully isn't the case with a lens based system that can interleave optimisation and compilation, but is the case elsewhere).

To be clear, I'm in favour of picking points of optimisation and simplifications, i just wanted to clarify that it's mostly useful for cases when the activation function sees an iterable of arrays (like a tuple or vector of array), and if there are specific advantages to automatically calling broadcasted operations (say for fusion) then perhaps overloading broadcasted would give the compiler more hints.

theabhirath · 2022-06-22T01:32:42Z

Could we get a patch release with this? Would be helpful 😅

mcabbott · 2022-06-23T14:50:22Z

I missed this, but before we release, is it a good idea?

It means Chain(Dense(2 => 2), relu)(rand(2, 2)) will work, instead of giving an explanatory error. But Chain(Dense(2 => 2), tanh)(rand(2, 2)) will do something quite different, and using your own function f will probably not work.

Another level at which this "do what I mean" could be implemented is to make the Chain constructor replace any lonely activation functions with Base.Fix1(broadcast, f). Perhaps with a warning. Then when you enter such a chain at the REPL, at least you know that it's been fixed up somehow.

darsnack · 2022-06-23T15:08:29Z

Unfortunately, we already released this change.

It means Chain(Dense(2 => 2), relu)(rand(2, 2)) will work, instead of giving an explanatory error. But Chain(Dense(2 => 2), tanh)(rand(2, 2)) will do something quite different, and using your own function f will probably not work.

My immediate thought is: when is activation(x::AbstractArray) not an element-wise operation? In almost any description of the model at the start of a paper, this is the assumed understanding. And this is clearly true for all the activations changed in this PR. So, my gut says there is almost no case where broadcasting isn't the correct and only thing to do, making the interpretation in this PR fairly safe. I'm not entirely clear on why the examples you gave should error.

Random thoughts:

Doing a replacement in a Chain constructor excludes Parallel, etc. We could do this in Parallel's constructor too, but then we exclude custom layers.
You could make it work for any model by doing a walk and replacing, but then the user has to do that to get more informative errors.
Doing things this way means that Dense(2 => 2, tanh) and Chain(Dense(2 => 2), tanh) do different things under the hood. Maybe Chain(..., tanh, ...) should be able to replace tanh with tanh_fast. Doesn't Chain(..., x -> tanh.(x), ...) also suffer from this issue?

mcabbott · 2022-06-23T15:30:02Z

this is the assumed understanding

I agree it's what's normally meant, it just seems a bit contrary to Julia's normal behaviour.

My tanh example is meant to highlight that some functions already have matrix definitions, which are quite different. Any function you write yourself out of exp, log etc. will also tend to have such a definition. Although in practice getting a square matrix input is going to be unlikely.

My Chain constructor idea may indeed not be a great one. If it's noisy, then it could be an earlier place to remind people "you need to broadcast that!" than waiting until a runtime error. There would be less expectation that such a "training wheels" feature also apply to Parallel etc.

darsnack · 2022-06-23T16:07:29Z

My tanh example is meant to highlight that some functions already have matrix definitions, which are quite different.

~~Good point, and actually these cases should probably reverted as piracy?~~ Still a good point but I misread the source. No explicit piracy here.

ToucheSir · 2022-06-23T17:15:46Z

The nuclear option is to do what PyTorch does and make vectorized activation layer types/constructors, then not export any of the activation functions themselves so that users are incentivized to use the former. That gets around some of the confusion with act.(x) vs act(x) not always being equivalent, but historically we've not wanted to go down this route.

theabhirath added 2 commits June 19, 2022 14:29

Define activation functions taking arrays as input

cb30e19

Tests

d651024

CarloLucibello approved these changes Jun 19, 2022

View reviewed changes

darsnack merged commit aad08b5 into FluxML:master Jun 19, 2022

theabhirath deleted the broadcast-act branch June 19, 2022 16:27

theabhirath mentioned this pull request Jun 21, 2022

Overhaul of ResNet API FluxML/Metalhead.jl#174

Merged

18 tasks

theabhirath mentioned this pull request Jun 10, 2023

Add docs note about saving/loading models with anonymous functions FluxML/Flux.jl#2263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define activation functions taking arrays as input #423

Define activation functions taking arrays as input #423

theabhirath commented Jun 19, 2022

darsnack commented Jun 19, 2022

DhairyaLGandhi commented Jun 19, 2022 via email •

edited

Loading

darsnack commented Jun 19, 2022

DhairyaLGandhi commented Jun 19, 2022

theabhirath commented Jun 22, 2022

mcabbott commented Jun 23, 2022 •

edited

Loading

darsnack commented Jun 23, 2022

mcabbott commented Jun 23, 2022

darsnack commented Jun 23, 2022 •

edited

Loading

ToucheSir commented Jun 23, 2022

Define activation functions taking arrays as input #423

Define activation functions taking arrays as input #423

Conversation

theabhirath commented Jun 19, 2022

darsnack commented Jun 19, 2022

DhairyaLGandhi commented Jun 19, 2022 via email • edited Loading

darsnack commented Jun 19, 2022

DhairyaLGandhi commented Jun 19, 2022

theabhirath commented Jun 22, 2022

mcabbott commented Jun 23, 2022 • edited Loading

darsnack commented Jun 23, 2022

mcabbott commented Jun 23, 2022

darsnack commented Jun 23, 2022 • edited Loading

ToucheSir commented Jun 23, 2022

DhairyaLGandhi commented Jun 19, 2022 via email •

edited

Loading

mcabbott commented Jun 23, 2022 •

edited

Loading

darsnack commented Jun 23, 2022 •

edited

Loading