-
-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freezing layer parameters still computes all gradients #1688
Comments
It is required to calculate partials over the parameters. This is because Zygote may require these partials to be used elsewhere in the computation to return back gradients of the set of parameters requested. If you want to see the final results more properly, it is recommended that we avoid global variables and access to them across Julia, but also in differentiation. Removing the global accesses also comes with a performance bump and the correct gradients being returned. julia> loss(m, x, y) = sum(Flux.crossentropy(m(x), y));
julia> function get_grads(m, data, labels, s)
gs = gradient(Flux.params(m[s:end])) do
l = loss(m, data, labels)
end
end
julia> get_grads(m, data, labels, 1).grads
IdDict{Any, Any} with 4 entries:
Float32[-0.16087 -0.0740108 … 0.122463 0.… => Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.…
Float32[-0.279719 0.265606 … 0.0243676 0.… => Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
Float32[0.0, 0.0] => Float32[0.0, 0.0]
Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0… => Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, …
julia> get_grads(m, data, labels, 2).grads
IdDict{Any, Any} with 2 entries:
Float32[-0.279719 0.265606 … 0.0243676 0.0286161; -0.148928 -0.0405258 … 0.295… => Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
Float32[0.0, 0.0] => Float32[0.0, 0.0]
julia> get_grads(m, data, labels, 3).grads
IdDict{Any, Any}() |
Great advice!
Which, as you mentioned is because the gradients are still computed in the background. |
loss(m, x, y) = Flux.crossentropy(m(x), y) # no need for sum, crossentropy aggregates by default
function get_grads(trunk, head, data, labels)
z = trunk(data)
gradient(Flux.params(head)) do
loss(head, z, labels)
end
end
get_grads(m[1:end-1], m[end], data, labels)
get_grads(m[1:end-2], m[end-1:end], data, labels)
get_grads(m[1:end-3], m[end-2:end], data, labels) You see this pattern quite a bit with |
@ToucheSir this is exactly what I was looking for! Thanks so much for the help :) |
I have been looking at this as a way to train only some of the layer parameters while freezing the others. More specifically, passing in a subset of the model (ex:
m[3:end]
) toFlux.params
. This does work, but I noticed that the runtime was surprisingly the same regardless of how many layers I would freeze.Below is a simple example showing this behaviour:
This seems to be happening because the gradients for all the layers are still being computed regardless of what is passed to
Flux.params
. For example:The gradients move to
:(Main.m)
, but are still computed.This behaviour is unideal (especially when only the last layer is updated) for runtime.
Is this intended and is there a way to stop the frozen layer gradients from being computed?
The text was updated successfully, but these errors were encountered: