-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug that caused Flux.params(x) call to not be cached (Closes issue #2040) #2048
Conversation
I implemented the DenseArray solution, performance is still fixed, and the tests all passed locally. |
|
I turned on downstream tests for this commit since |
@@ -36,16 +36,14 @@ Possible values include: | |||
""" | |||
trainmode!(m, mode = true) = mode isa Bool ? testmode!(m, !mode) : testmode!(m, mode) | |||
|
|||
params!(p::Params, x::DenseArray{<:Number}, seen = IdSet()) = Functors.isleaf(x) && push!(p, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a DenseArray{<:Number}
ever not be a leaf?
params!(p::Params, x::DenseArray{<:Number}, seen = IdSet()) = Functors.isleaf(x) && push!(p, x) | |
params!(p::Params, x::DenseArray{<:Number}, seen) = push!(p, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yes, Base.Experimental.Const
is a pure wrapper type and subtypes DenseArray
. I've seen it used in JuliaGPU libraries, but am unsure if those would ever come in contact with params
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting type. But I guess it will always be leaf-like, Functors should treat it as it would an SArray right?
More broadly if this method has a test for isleaf
, then it has to do something with the other branch. And then it's the other method. I guess it could assert isleaf just to make sure you get an error if someone does something really weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it has to be a non-leaf for the same reason Transpose
does: shared inner arrays.
RE the other branch, I thought the latest change addressed that but it appears I misremembered. Silently dropping an array instead of recursing is definitely not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess. Although transposing one of two shared arrays is common, but marking as Const only one of the two seems perverse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me update my suggestion. I think this ought to be safe, and will at least throw an error should someone ever @functor Base.Experimental.Const
(or its CUDA analogue):
params!(p::Params, x::DenseArray{<:Number}, seen = IdSet()) = Functors.isleaf(x) && push!(p, x) | |
function params!(p::Params, x::DenseArray{<:Number}, seen = IdSet()) | |
# Fast path for the most common case, Array & CuArray. Solves issue 2040. | |
Functors.isleaf(x) || error("For efficiency, params believes every DenseArray of numbers is leaflike") | |
push!(p, x) | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code returns size.(Flux.params((x = view([1,2,3]pi, 1:2), y = transpose([4 5]pi)))) == [(1, 2)]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code and what else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code with the above suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. On this example, the suggestion changes nothing compared to the PR. It just moves the isleaf
test to be an error not an ignore.
I think such a fast method should exist alongside the method which was here before the PR, which handles all cases (but has more branches). That should be correct. Whether it still solves 2040 I don't know.
|
||
push!(seen, x) | ||
for child in trainable(x) | ||
params!(p, child, seen) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I have a leaf type which isn't a DenseArray? The current behaviour is:
julia> using NamedDims, StaticArrays
julia> Flux.params((SA[2.2], 3:4.0, NamedDimsArray([5.0], :x)))
Params([[2.2], 3.0:1.0:4.0, NamedDimsArray([5.0], :x)])
What I meant with the DenseArray idea was that this method could be a short-cut for the common case, in addition to the existing method.
Of course the tests as always don't try very hard. But I do think that it ought to keep working with wrappers like NamedDims.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any wrappers already on the dependency chain which have this same behaviour outside of NamedDims?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe SubArray? ReshapedArray, SymTridiagonal ... for tests I guess you want something unlikely to be @functor
-ed in the future.
julia> Flux.params(view([1,2,3]pi, 1:2))
Params([[3.141592653589793, 6.283185307179586]])
julia> ans[1] isa DenseArray
false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SubArray
s are a bit of a landmine IMO because they don't "cover" the entirety of the wrapped array. ReshapedArray
makes sense though. Was it that or PermutedDimsArray
that we found couldn't have its transform easily reversed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC ReshapedArray was the tricky one, as its type doesn't have the shape.
@ToucheSir @mcabbott This is great! I'm learning a lot reading your discussions. I'm quite new to Flux and its inner workings. I'm more than happy to continue working on this pull request, but I'll need guidance. |
IMO you should probably restore the method to work the way it did, but add to this "slow" case a fast path which will be taken by ordinary CuArrays. As you can see, Brian and I have been down the rabbit hole of what should be leaflike before... but I think that testing something like |
I'm adding the test that you mentioned. I checked the behaviour with 0.13.4 and 0.13.5, and I'm wondering which one is the intended behaviour. 0.13.4: 0.13.5: Both find all the parameters, but the shapes are different. |
I think that's expected. With 0.13.5, Functors 0.3, it recurses inside Tangent (to see W and W' as the same parameter), while before it didn't. |
Okay I'll add the test to check for 0.13.5 behaviour.
This wouldn't fix the issue on CPU's though. The root issue seems to be that the compiler doesn't cache the result of the function call when that call contains if statements. Now that I understand the cause, I can easily work around it by calling Should I continue working on this or should I submit a pr with just the new test and cose this issue? |
Have you verified this (for the proposed solution, that is)? It certainly isn't true in general (otherwise most Flux models would have this problem), so tweaking things a bit may be all that's required. |
Good point, I'll edit my comment to specify that it's only in situations where I tested with both cpu and gpu, with my convolutional variational autoencoder and a variational autoencoder (Dense layers only), and that regularization that calls |
I should clarify: we know that is true for the current implementation on 0.13.5/master, but is it still true if you implement the suggestion here? |
Oh I see, so would that look something like this:
I tried it and it's still spending most of each step compiling. |
If I understand right, you have something like If that's true, then maybe we shouldn't be trying to make the construction of
julia> using Flux: params, gradient
julia> model = (x=rand(3), y=rand(3)); tot(m) = sum(m.x + m.y);
julia> g = gradient(params(model)) do
tot(model)
end
Grads(...)
julia> g[model.x]
3-element Fill{Float64}, with entries equal to 1.0
julia> g2 = gradient(params(model)) do
tot(model) + sum(sum, params(model))
end
Grads(...)
julia> g2[model.x]
3-element Fill{Float64}, with entries equal to 2.0
julia> g3 = gradient(params(model)) do
tot(model) + sum(sum, ps)
end
Grads(...)
julia> g3[model.x]
3-element Fill{Float64}, with entries equal to 2.0
julia> @eval Zygote function params(m...)
ps = Params()
ignore_derivatives() do
params!(ps, m)
end
return ps
end
params (generic function with 1 method)
julia> Zygote.refresh()
julia> g4 = gradient(params(model)) do
tot(model) + sum(sum, params(model))
end
Grads(...)
julia> g4[model.x]
3-element Fill{Float64}, with entries equal to 2.0 |
I'm all for this if it can be done while keeping existing nested AD code working. Tricky bits include making sure to call |
7064b94
to
6718314
Compare
@ToucheSir Is there a test for nested gradient calls? If not could you provide a sample use-case for my understanding (and to turn into a test)? |
I'm not aware of any, so I tried coming up with one. Funnily enough, all the examples I could think of either didn't work on master or behaved as if A couple working examples from my testing which behave exactly the same if using Flux, LinearAlgebra
x = ones(1, 1)
d = Dense([2.0;;], [3.0])
gradient(() -> sum(d(x)) + sum(p -> 2norm(p), Flux.params(d)), Flux.params(d)).grads
gradient(() -> sum(d(x)) + sum(gradient(() -> sum(d.weight), Flux.params(d))[d.weight]), Flux.params(d)) |
It seems like adding |
Closing this since issue #2054 supersedes it and was merged! |
Should this PR be accepted, it would close issue #2040 which was caused by constant recompilation causing massive slowdowns when using the GPU.
I am pretty sure that this has the exact same behaviour as in v0.13.5. I reverted to the v0.13.4 version of
params!()
and added a check forisleaf(x)
whenx
is anAbstractArray{<:Number}
.Let me know if any changes need to be made.