-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More complex models: multiple inputs, multiple outputs, and parallel paths inside a model #1289
Conversation
… custom layer description, and in the src as layers 'Join', 'Split', and 'Parallel'
I'd forgo the layers for now and focus on the design aspect of it for this PR, and can revisit the layers in a separate PR |
Is it not a performance problem that the Customxxx structs are not parametric ? How about defining the struct like
so that the compiler can generate optimized code ? My bad, I just saw that the code in your local branch has parametric types. |
I would also suggest to change This makes the function type-stable and therefore probably much faster :) |
Can all types of connections be expressed with these types of layers or will there still be 'entangled' cases which can't? Whats the advantage v.s. just going full static computation graph like keras/torchscript/onnx? |
@HenriDeh
I have only tested this with dense layers and cuda. Because of vcat, I am unsure if conv or other layer may work. Concerning your second point: I am not sure why flux.jl shouldn´t have these features? |
One more then: I just read in Flux documentation performance tips that using
is better optimized than splatting
|
Bump |
I’d be happy to help out here, but I don’t want to get in the way or cause divergence. What are the design aspects which needs to be discussed? I have a framework agnostic implementation of a standard static computation graph here. Its current home package comes a lot of stuff for progamatically modifying the graph, but It could easily be moved to an own package with very few (if any) dependencies if there is interest. Flux could then either depend on it (or perhaps better) just direct people to that package. There is nothing stopping anyone from trying it out today though. The API is a bit clunky as it was not designed to be used by humans, but one can easily make their own leaner version like I have done below: using NaiveNASflux
# Works for all Flux layers, but lets make a simple dense example here
julia> dense(insize::Integer, outsize, act) = dense(inputvertex("in", insize), outsize, act)^C
julia> dense(input::AbstractVertex, outsize, act) = mutable(Dense(nout(input), outsize, act), input)
dense (generic function with 2 methods)
julia> v1 = dense(3, 4, relu);
julia> v2 = dense(v1, 3, elu);
julia> v3 = concat(v1, v2);
julia> v4 = dense(v3, 4, relu);
julia> v5 = v1 + v4;
julia> graph = CompGraph(inputs(v1), v5);
julia> graph(ones(3, 2))
4×2 Array{Float32,2}:
0.289443 0.289443
1.15558 1.15558
0.852968 0.852968
0.0 0.0
# Vertices can ofc also be called as functions
julia> v1(ones(3,2))
4×2 Array{Float32,2}:
0.0 0.0
1.15558 1.15558
0.612048 0.612048
0.0 0.0 |
Allowing for making such layers composable, parallelization over tasks, and reduction via user defined functions, and performance considerations would be some of the things to get a grip on. |
Those are indeed important issues. I see it as one advantage of the traditional static computation graph structure (e.g. TensorFlow, ONNX, Dagger etc.) that its properties are well known so assessing those issues ought to be easier. Are there any particular drawbacks which makes it something to stay clear of? On a higher level, what are the missing things in Flux which one seeks to address here? For example, one drawback of just using raw Julia functions (which is a very appealing property of frameworks like Flux and PyTorch imo) is that things like I'd be happy to do some analysis of the mentioned issues for the standard static graph case, but I'm thinking it will be more than a couple of lines so maybe it is not the right place to post it in an issue tracker. Oh, and if these questions have already been discussed and decided upon in some other forum then just let me know and I'll stop pestering you guys about this matter. No feelings hurt, I promise :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments about performance changes.
I think I might be missing something, but I don't see the purpose of the docs addition. I agree there should be a section about using these advanced layers, but as far as I can tell, the current docs addition just walks through reimplementing the layers already in structure.jl
. It would be more useful to walk through how these layers can be used for things like inception modules.
On that last point, can we update these layers to support arbitrary reduce operators? If so, then Parallel
could replace SkipConnection
. The existence of both seems redundant to me.
|
||
function (w::Split)(x::AbstractArray) | ||
# may have performance issues ... some julia/flux pro should maybe rework this | ||
tuple([w.fs[i](x) for i in 1:length(w.fs)]...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be improved by using map
:
julia> fs = [sum for _ in 1:10]
julia> @benchmark tuple([fs[i]($(rand(1000))) for i in 1:length(fs)]...)
BenchmarkTools.Trial:
memory estimate: 720 bytes
allocs estimate: 27
--------------
minimum time: 2.177 μs (0.00% GC)
median time: 2.292 μs (0.00% GC)
mean time: 2.437 μs (0.77% GC)
maximum time: 192.556 μs (97.19% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark tuple(map(f -> f($(rand(1000))), fs)...)
BenchmarkTools.Trial:
memory estimate: 464 bytes
allocs estimate: 14
--------------
minimum time: 1.658 μs (0.00% GC)
median time: 1.717 μs (0.00% GC)
mean time: 1.832 μs (0.92% GC)
maximum time: 172.004 μs (97.52% GC)
--------------
samples: 10000
evals/sample: 10
""" | ||
Nop() - No-OPeration | ||
|
||
Create a new 'Nop' layer that does nothing and just passes the value on. | ||
This can be useful if you want to pass the output of a split layer directly | ||
to the output. | ||
|
||
The input 'x' must be a common input format. The out 'y' has the same format | ||
as the input format. | ||
|
||
The internal action is `x->x` | ||
|
||
# Example | ||
``` | ||
julia> using flux | ||
|
||
julia> using CUDA | ||
|
||
julia> model = Chain( | ||
Dense(1, 2), | ||
Split( | ||
Dense(2, 1), | ||
Nop | ||
) | ||
) | ||
|
||
julia> julia> model(cu(rand(1))) | ||
(Float32[0.12], (Float32[0.52], Float32[0.23])) | ||
``` | ||
""" | ||
|
||
struct Nop | ||
empt # workaround?, 'empt' ~ empty | ||
end | ||
|
||
function Nop() | ||
Nop(nothing) | ||
end | ||
|
||
function (w::Nop)(x::AbstractArray) | ||
return x | ||
end | ||
|
||
function Base.show(io::IO, j::Nop) | ||
print(io, "Nop") | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't identity
serve as Nop
? Then there is no need to create an additional type.
|
||
function (w::Parallel)(x::AbstractArray) | ||
# may have performance issues ... some julia/flux pro should maybe rework this | ||
vcat([w.fs[i](x) for i in 1:length(w.fs)]...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here. It would be faster as:
mapreduce(f -> f(x), vcat, w.fs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on this point. Of course, we need the reduction operator to be user defined, but that's separate.
function (w::Join)(t::Tuple) | ||
# may have performance issues ... some julia/flux pro should maybe rework this | ||
@assert length(w.fs) == length(t) | ||
vcat([w.fs[i](t[i]) for i in 1:length(w.fs)]...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same again, but map over both the functions and the input:
mapreduce((f, x) -> f(x), vcat, fs, t)
@darsnack Yes, the docs double themselves with the implementation. When I created this, the idea was that the docs might be accepted more quickly, so then users don't have to search for everything. Nice optimizations btw. |
I guess the silence around my questions around a more traditional DAG format means "no" then :) For the proposal here, I think you need to let the user supply the reduction method. For example conv outputs need to use something like |
Yeah I think once this PR makes the changes to allow user-supplied reduce operators and the optimizations, then it should be ready to merge (probably want to refactor the docs additions first).
😄 I think I posted some longer thoughts here already, but I feel that DAG-style execution should be in packages built on top of Flux. Similar to your comments on the linked issue, I think that the simplicity of Flux's layer API is a good thing. But the brilliance of Flux/Zygote is that
|
Definitely agree, but why should the layers proposed here be held to a different standard as they are clearly a way to express the DAG and contributes to keeping Flux less lean and simple? One risk I see with this approach is the death by a thousand cuts since just as I basically just blurted out a suggestion that doing something like Keras |
Maybe we should just document some generic pattern to create arbitrarily complex models. For instance, this would be very familiar to pytorch users: using Flux
struct NN
layers
end
Flux.@functor NN
function NN()
layers = [Dense(5,5), Dense(5,3)]
return NN(layers)
end
function (m::NN)(x)
x = relu.(x + m.layers[1](x))
return m.layers[2](x)
end
model = NN()
@assert length(Flux.params(model)) == 4
ŷ = model(randn(5, 3))
@assert size(ŷ) == (3, 3) Additionally, we could implement a |
This is a great point, and I agree that it is generally better not to introduce new layers (or types in general) unless the benefit crosses some criteria. For me, that criteria is does a new layer collect parameters in a useful way (e.g. The ability to dispatch is crucial for writing functions in code of the form def f(m):
if isinstance(m, BasicBlock):
# code
elif isinstance(m, Bottleneck):
# code
elif isinstance(m, InceptionModule):
# code
elif isinstance(m, Conv2d):
# code
elif isinstance(m, Linear):
# code...you get the idea
return m The def f_basicblock(m):
# code
def f_bottleneck(m):
# code
def f_inception(m):
# code
def f_conv(m):
# code
def f_linear(m):
# code
def f(m):
if isinstance(m, BasicBlock):
return f_basicblock(m)
elif isinstance(m, Bottleneck):
return f_bottleneck(m)
elif isinstance(m, InceptionModule):
return f_inception(m)
elif isinstance(m, Conv2d):
return f_conv(m)
elif isinstance(m, Linear):
return f_linear(m) At this point you've basically implemented f(m::BasicBlock) = # code
f(m::Bottleneck) = # code
f(m::InceptionModule) = # code
f(m::Conv2d) = # code
f(m::Linear) = # code But the real problem is that def f_parallel(branches, reduce_op):
# shared code
def f_basicblock(m):
# de-sugars the fields of m to call the generic f_parallel
# ... omitted to be concise f(branches::AbstractVector, reduce_op) = # shared code
f(m::BasicBlock) = # de-sugar
# ... Still, if all three layers were the same type, we could write just one generic function in Julia that worked for all layers instead of a generic function + 3 routing functions: f(m::Parallel) = # shared code Okay, so hopefully you can see that having a At the end of the day, a neural network is a graph by definition. So I don't think that just because a certain layer can be seen as representing a subgraph that it should be excluded from Flux. Taking that to the extreme, we would have no layers and just closures over calls to NNlib. In the other extreme, we would have every model represented by a graph data structure instead of functions and structs. Clearly, a happy medium is probably going to be most pleasurable to use, and hopefully I convinced everyone that |
1462: Add Parallel layer r=DhairyaLGandhi a=darsnack Since #1289 stalled, I have added an implementation of `Parallel` with some of the changes we discussed during ML calls. This version excludes most of the structural layers in #1289 like `Join`, `Split`, and `Nop`. I also added the ability for the user to specify the reduction operator. If it is acceptable, I would like to remap `SkipConnection` to `Parallel` (not a deprecation exactly). The reason for submitting this PR now is because I am creating pre-trained weights for the networks in FluxML/Metalhead.jl#70, and there is a lot of code that can be replaced with a `Parallel`. So, I'd like to have `Parallel` in Flux before continuing with training to make the process easier. ### PR Checklist - [x] Tests are added - [x] Entry in NEWS.md - [x] Documentation, if applicable - [x] Final review from @DhairyaLGandhi (for API changes). cc @CarloLucibello Co-authored-by: Kyle Daruwalla <[email protected]> Co-authored-by: Kyle Daruwalla <[email protected]>
I'd like to bump this. Looks like there's still some outstanding fixes that need to be made to this PR, right? I'm having to write my own |
A Split(branch, branches...) = Parallel(tuple, branch, branches...) Calling this with a single input will pass the same input to each branch. |
I think his point is that it should be pre-built in Flux, not because it's hard to do it yourself (it is not), but because it is sufficiently frequently used in ML to justify it. And I agree, many models predict Gaussian means and std. These have two heads. Pre-building this makes Flux more accessible to newcomers and ML practitioners that are not that proficient in Julia. |
I agree with @darsnack that Flux should not (except for very specific circumstances) be in the business of adding aliases purely for convenience. IMO
That said, Footnotes
|
Reusing code from the right place is important, and I agree |
I'm closing this as most of the work has been merged into Flux for quite some time, and the original author does appear to be responsive. As for I would suggest opening an issue (or filing a PR) if there's some piece here that anyone still wants to see completed. |
In coordination with @DhairyaLGandhi I added new parts to the docs and added new layers for a feature that I have been missing for some time now:
The documentation has been extended with examples for custom implementations of these layers (see "advanced model building"), and the new layers Join, Split and Parallel have been added to the src.
The new layers run on GPU (yet not tested with more complex layers, e.g. conv). Any number of inputs, outputs or parallel paths should be possible. Simple testing and documentation is also implemented. Maybe a real Julia pro can take a look at the layers to improve the performance or replace vcat with something better.
Additionally a "No-OPeration" Layer ("Nop") is added, so that paths without an action are also possible, instead of writing "x->x".
I tried to stay simple with the design. I would be happy about good ideas and comments, and if these changes are accepted somehow, so that we can use this or a similar implementation soon.
PR Checklist
@dhairyagandhi96
(for API changes).