Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MaxOut layer #647

Merged
merged 13 commits into from
Mar 22, 2019
Merged

Add MaxOut layer #647

merged 13 commits into from
Mar 22, 2019

Conversation

oxinabox
Copy link
Member

No description provided.

docs/src/models/layers.md Outdated Show resolved Hide resolved
src/layers/basic.jl Outdated Show resolved Hide resolved
src/layers/basic.jl Outdated Show resolved Hide resolved
end

"""
MaxOut(f, n_alts, args...; kwargs...)
Copy link
Member

@MikeInnes MikeInnes Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just pass in those layers, like MaxOut(a, b, c)? Or at least avoid simulating a closure by writing MaxOut(n -> Dense(...), 4). We could have both but then would have to pass a tuple in the first case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, using different layers is not a thing.
Also using layers that are not dense/linear is not a thing.
The paper is just about identically structured dense linear layers.

Allowing them to be different, and allowing them to be any kind of
felt like a natural generalisation, and it made testing easier.

Here the line in my code using this

 MaxOut(Dense, 4, insize, hlsize)

Here is the Keras code this is replacing

MaxoutDense(hlsize, input_dim=input_units, nb_feature=4, bias=True

What are your proposals?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to optimise for the common case of course. But in that case I think MaxOut(_ -> Dense(insize, outsize), 4) gives a clearer sense that this is doing something with 4 dense layers (it looks like ntuple, which is basically what it is).

Copy link
Member Author

@oxinabox oxinabox Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is growing on me.

Another option is:
MaxOut(Dense(insize, outsize), 4),
and then we call deepcopy on the first argument to get 4 versions of it.
Not sure I like that. Very hard to know if it would do it correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've sold myself on your idea.
It seperates which thigns are arguments of the inner layer, and which things are arguments of the outer layer.
Going forward we sould make this a convention if it comes up again.

end

function (mo::MaxOut)(input::AbstractArray)
mapreduce(f -> f(input), (acc, out) -> max.(acc, out), mo.over)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find max.(map(f -> f(input), mo.over)...) clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm...
Will the compiler unroll the map and the splat?

It should be possible to do so since everything is tuples.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapreduce is noticably faster.
over 2x faster.

julia> foo(fs,x) = max.(map(f->f(x), fs)...)
foo (generic function with 1 method)

julia> bar(fs,x) = mapreduce(f->f(x), (a,b)->max.(a,b), fs)
bar (generic function with 1 method)

julia> @btime foo((x->2x, x->0.5x, x->3x, x->4x, x->6x), $[1:1000.0;]);
  9.469 μs (9 allocations: 47.72 KiB)

julia> @btime bar((x->2x, x->0.5x, x->3x, x->4x, x->6x), $[1:1000.0;]);
  4.989 μs (11 allocations: 71.47 KiB)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to record what I mentioned on slack, tracked arrays might have slightly different perf here (broadcast is much heavier and SIMD probably won't kick in either way), so it'd be worth giving that case a quick test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for TrackedArrays

julia> data = Flux.TrackedArray(collect(1:1000));

julia> using BenchmarkTools

julia> foo(fs,x) = max.(map(f->f(x), fs)...)
foo (generic function with 1 method)

julia> bar(fs,x) = mapreduce(f->f(x), (a,b)->max.(a,b), fs)
bar (generic function with 1 method)

julia> @btime bar((x->2x, x->0.5x, x->3x, x->4x, x->6x), $data);
  7.891 μs (65 allocations: 73.08 KiB)

julia> @btime foo((x->2x, x->0.5x, x->3x, x->4x, x->6x), $data);
  12.359 μs (56 allocations: 49.13 KiB)

We hafve resolved in slack that we will use mapreduce
and if anyone complains it is too slow rewrite it using a generated function.

@nickrobinson251
Copy link
Contributor

Very minor: the name Maxout would be consistent with the paper (and with Dropout), as opposed to MaxOut :)

@oxinabox
Copy link
Member Author

oxinabox commented Mar 6, 2019

Done

```
"""
function Maxout(f, n_alts, args...; kwargs...)
over = Tuple(f(args...; kwargs...) for _ in 1:n_alts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature needs an update now that we're supporting the closure-ised version instead of passing args through.

Also I think it'd make sense to use ntuple(f, n_alts) to construct the tuple. It's clean and it means you can write _ -> Dense(...), which saves a character :) but the n might be useful in some cases also.

Copy link
Member Author

@oxinabox oxinabox Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful.
I worry that we are overengineering generality in to this that is not present in the paper.

If we wanted to make it easier/less complex to just do the thing in the paper (and the thing Karas lets you do)
could add.

MaxoutDense(insize, outsize, n_alts)  = Maxout(_->Dense(insize, outsize),  n_alts)

But looking at that writen I don't know that I like the LHS more than the RHS anyway.

Copy link
Member Author

@oxinabox oxinabox Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about it more.
I can't imagine passing in the number would every be useful.
Maxout is defined as the max over structurally identical internal layers.

Most things you might thing about doing with it are

  • going to break it Maxout(n->Dense(128, 128n), 4)
  • pointless Maxout(n->n*Dense(128, 128), 4)
  • Weird Maxout(n->Dense(128,128; σ=(relu, identity, tanh, sigmoid)[n])
  • Complex enough that they should be written seperately:
Maxout(n->Chain(Dense(128,128n, σ=sigmoid), Dense(128n, 128)), 4)

vs

inner_layers = ntuple(n->Chain(Dense(128,128n, σ=sigmoid), Dense(128n, 128)),  4)
Maxout(inner_layers)

I think it is better to be able to say if you want to construct differing internal layers,
then you should do so explictly and pass them in as a tuple

This extraflexibility I think makes the code harder to read (even if it can be 1 character shorter).
it is readily apparent what a ()->f(z) does, and how I can change/tweak it.
where as the purpose of the argument in _->f(z) is not clear, and it isn't obvious what the n is doing in n->g(z,n)

docs/src/models/layers.md Outdated Show resolved Hide resolved
src/Flux.jl Outdated Show resolved Hide resolved
docs/src/models/layers.md Outdated Show resolved Hide resolved
@oxinabox
Copy link
Member Author

Any further thoughts?

@MikeInnes
Copy link
Member

I think we are in general agreement on how to tweak the API; I think it'll be good to go once that's done.

@oxinabox
Copy link
Member Author

API tweaks are done.
It matches as discussed in #647 (comment)

end

"""
Maxout(f, n_alts, args...; kwargs...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're supporting the closure-based API we can remove the splatting from the signature.

I think we may as well document the signature as Maxout(() -> ...) if that's always the intended use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, mmy bad i forget to remove the arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is Maxout(f, n_alts)
I feel like Maxout(()->..., n_alts) is a bit less clear than having f then being able to reference f in the docstring below.

The docstring does incluide an example

Maxout(()->Dense(insize, outsize), 4)

@MikeInnes
Copy link
Member

Alright, I think this is pretty perfect. Thanks a lot @oxinabox!

@MikeInnes MikeInnes merged commit b637311 into FluxML:master Mar 22, 2019
@oxinabox oxinabox mentioned this pull request Mar 25, 2019
BerenMillidge pushed a commit to BerenMillidge/Flux.jl that referenced this pull request Dec 20, 2019
@mcabbott mcabbott mentioned this pull request Nov 30, 2021
bors bot added a commit that referenced this pull request Dec 13, 2021
1794: Tidy up `Maxout` r=mcabbott a=mcabbott

Maxout is from #698 . This:

* adds pretty printing
* changes the explicit signature to `Maxout(layer, layer, layer)`, rather than providing a tuple, to be more like other layers (with deprecation)
* adds more examples to the docstring, and combines the two
* changes not to use `mapreduce`. I see now this was a performance choice at the time, discussed here #647 (comment) , but with Zygote this is much slower.

Before:
```
julia> using Flux

julia> m3 = Maxout(() -> Dense(5, 7, tanh), 3)
Maxout{Tuple{Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}, Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}, Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}}}((Dense(5, 7, tanh), Dense(5, 7, tanh), Dense(5, 7, tanh)))

julia> x = rand(Float32, 5, 11);

julia> `@btime` gradient(sum∘m3, $x);
  min 112.792 μs, mean 123.774 μs (930 allocations, 49.09 KiB. GC mean 3.71%)
```
After:
```
julia> m3 = Maxout(() -> Dense(5, 7, tanh), 3)
Maxout(
  Dense(5, 7, tanh),                    # 42 parameters
  Dense(5, 7, tanh),                    # 42 parameters
  Dense(5, 7, tanh),                    # 42 parameters
)                   # Total: 6 arrays, 126 parameters, 888 bytes.

julia> x = rand(Float32, 5, 11);

julia> `@btime` gradient(sum∘m3, $x);
  min 34.541 μs, mean 38.448 μs (493 allocations, 32.48 KiB. GC mean 6.63%)
```

Co-authored-by: Michael Abbott <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants