-
-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MaxOut layer #647
Add MaxOut layer #647
Conversation
src/layers/basic.jl
Outdated
end | ||
|
||
""" | ||
MaxOut(f, n_alts, args...; kwargs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just pass in those layers, like MaxOut(a, b, c)
? Or at least avoid simulating a closure by writing MaxOut(n -> Dense(...), 4)
. We could have both but then would have to pass a tuple in the first case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, using different layers is not a thing.
Also using layers that are not dense/linear is not a thing.
The paper is just about identically structured dense linear layers.
Allowing them to be different, and allowing them to be any kind of
felt like a natural generalisation, and it made testing easier.
Here the line in my code using this
MaxOut(Dense, 4, insize, hlsize)
Here is the Keras code this is replacing
MaxoutDense(hlsize, input_dim=input_units, nb_feature=4, bias=True
What are your proposals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to optimise for the common case of course. But in that case I think MaxOut(_ -> Dense(insize, outsize), 4)
gives a clearer sense that this is doing something with 4 dense layers (it looks like ntuple
, which is basically what it is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is growing on me.
Another option is:
MaxOut(Dense(insize, outsize), 4)
,
and then we call deepcopy
on the first argument to get 4 versions of it.
Not sure I like that. Very hard to know if it would do it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've sold myself on your idea.
It seperates which thigns are arguments of the inner layer, and which things are arguments of the outer layer.
Going forward we sould make this a convention if it comes up again.
end | ||
|
||
function (mo::MaxOut)(input::AbstractArray) | ||
mapreduce(f -> f(input), (acc, out) -> max.(acc, out), mo.over) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd find max.(map(f -> f(input), mo.over)...)
clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm...
Will the compiler unroll the map
and the splat?
It should be possible to do so since everything is tuples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapreduce is noticably faster.
over 2x faster.
julia> foo(fs,x) = max.(map(f->f(x), fs)...)
foo (generic function with 1 method)
julia> bar(fs,x) = mapreduce(f->f(x), (a,b)->max.(a,b), fs)
bar (generic function with 1 method)
julia> @btime foo((x->2x, x->0.5x, x->3x, x->4x, x->6x), $[1:1000.0;]);
9.469 μs (9 allocations: 47.72 KiB)
julia> @btime bar((x->2x, x->0.5x, x->3x, x->4x, x->6x), $[1:1000.0;]);
4.989 μs (11 allocations: 71.47 KiB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to record what I mentioned on slack, tracked arrays might have slightly different perf here (broadcast is much heavier and SIMD probably won't kick in either way), so it'd be worth giving that case a quick test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for TrackedArrays
julia> data = Flux.TrackedArray(collect(1:1000));
julia> using BenchmarkTools
julia> foo(fs,x) = max.(map(f->f(x), fs)...)
foo (generic function with 1 method)
julia> bar(fs,x) = mapreduce(f->f(x), (a,b)->max.(a,b), fs)
bar (generic function with 1 method)
julia> @btime bar((x->2x, x->0.5x, x->3x, x->4x, x->6x), $data);
7.891 μs (65 allocations: 73.08 KiB)
julia> @btime foo((x->2x, x->0.5x, x->3x, x->4x, x->6x), $data);
12.359 μs (56 allocations: 49.13 KiB)
We hafve resolved in slack that we will use mapreduce
and if anyone complains it is too slow rewrite it using a generated function.
Very minor: the name |
Done |
src/layers/basic.jl
Outdated
``` | ||
""" | ||
function Maxout(f, n_alts, args...; kwargs...) | ||
over = Tuple(f(args...; kwargs...) for _ in 1:n_alts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature needs an update now that we're supporting the closure-ised version instead of passing args through.
Also I think it'd make sense to use ntuple(f, n_alts)
to construct the tuple. It's clean and it means you can write _ -> Dense(...)
, which saves a character :) but the n
might be useful in some cases also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be useful.
I worry that we are overengineering generality in to this that is not present in the paper.
If we wanted to make it easier/less complex to just do the thing in the paper (and the thing Karas lets you do)
could add.
MaxoutDense(insize, outsize, n_alts) = Maxout(_->Dense(insize, outsize), n_alts)
But looking at that writen I don't know that I like the LHS more than the RHS anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about it more.
I can't imagine passing in the number would every be useful.
Maxout is defined as the max over structurally identical internal layers.
Most things you might thing about doing with it are
- going to break it
Maxout(n->Dense(128, 128n), 4)
- pointless
Maxout(n->n*Dense(128, 128), 4)
- Weird
Maxout(n->Dense(128,128; σ=(relu, identity, tanh, sigmoid)[n])
- Complex enough that they should be written seperately:
Maxout(n->Chain(Dense(128,128n, σ=sigmoid), Dense(128n, 128)), 4)
vs
inner_layers = ntuple(n->Chain(Dense(128,128n, σ=sigmoid), Dense(128n, 128)), 4)
Maxout(inner_layers)
I think it is better to be able to say if you want to construct differing internal layers,
then you should do so explictly and pass them in as a tuple
This extraflexibility I think makes the code harder to read (even if it can be 1 character shorter).
it is readily apparent what a ()->f(z)
does, and how I can change/tweak it.
where as the purpose of the argument in _->f(z)
is not clear, and it isn't obvious what the n
is doing in n->g(z,n)
Co-Authored-By: oxinabox <[email protected]>
Co-Authored-By: oxinabox <[email protected]>
Co-Authored-By: oxinabox <[email protected]>
Co-Authored-By: oxinabox <[email protected]>
Any further thoughts? |
I think we are in general agreement on how to tweak the API; I think it'll be good to go once that's done. |
API tweaks are done. |
src/layers/basic.jl
Outdated
end | ||
|
||
""" | ||
Maxout(f, n_alts, args...; kwargs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're supporting the closure-based API we can remove the splatting from the signature.
I think we may as well document the signature as Maxout(() -> ...)
if that's always the intended use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, mmy bad i forget to remove the arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is Maxout(f, n_alts)
I feel like Maxout(()->..., n_alts)
is a bit less clear than having f
then being able to reference f
in the docstring below.
The docstring does incluide an example
Maxout(()->Dense(insize, outsize), 4)
Alright, I think this is pretty perfect. Thanks a lot @oxinabox! |
Add MaxOut layer
1794: Tidy up `Maxout` r=mcabbott a=mcabbott Maxout is from #698 . This: * adds pretty printing * changes the explicit signature to `Maxout(layer, layer, layer)`, rather than providing a tuple, to be more like other layers (with deprecation) * adds more examples to the docstring, and combines the two * changes not to use `mapreduce`. I see now this was a performance choice at the time, discussed here #647 (comment) , but with Zygote this is much slower. Before: ``` julia> using Flux julia> m3 = Maxout(() -> Dense(5, 7, tanh), 3) Maxout{Tuple{Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}, Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}, Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}}}((Dense(5, 7, tanh), Dense(5, 7, tanh), Dense(5, 7, tanh))) julia> x = rand(Float32, 5, 11); julia> `@btime` gradient(sum∘m3, $x); min 112.792 μs, mean 123.774 μs (930 allocations, 49.09 KiB. GC mean 3.71%) ``` After: ``` julia> m3 = Maxout(() -> Dense(5, 7, tanh), 3) Maxout( Dense(5, 7, tanh), # 42 parameters Dense(5, 7, tanh), # 42 parameters Dense(5, 7, tanh), # 42 parameters ) # Total: 6 arrays, 126 parameters, 888 bytes. julia> x = rand(Float32, 5, 11); julia> `@btime` gradient(sum∘m3, $x); min 34.541 μs, mean 38.448 μs (493 allocations, 32.48 KiB. GC mean 6.63%) ``` Co-authored-by: Michael Abbott <[email protected]>
No description provided.