You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
using Flux
d1 =Dense(2, 2)
d2 =Dense(d1.weight, d1.bias)
c =Chain(d1, d2)
x =rand(2, 1)
julia>gradient(c) do m
sum(m(x))
end
((layers = ((weight = [0.054832234281395820.4982746718431635; 0.0073556670527210030.06684284590835438], bias = Float32[1.0233454, 0.13728034], σ =nothing), (weight = [0.17342295769893964-0.05174787750918945; 0.17342295769893964-0.05174787750918945], bias = Float32[1.0, 1.0], σ =nothing)),),)
julia>gradient(params(c)) dosum(c(x))
end.grads
IdDict{Any, Any} with 3 entries:
Float32[1.013220.244673; 0.0101244-0.107393] => [0.2282550.446527; 0.1807790.015095]
:(Main.x) => [1.03826; 0.235642;;]
Float32[0.0, 0.0] => Float32[2.02335, 1.13728]
One one hand, you could argue that having different gradients at different locations is semantically more correct for the structural case. On the other hand, using 2x (or higher, if parameters are shared multiple times) more valuable (V)RAM is going to make having one array per location a hard sell for models that are already hitting up against a resource limit. Do we have a mechanism for providing both implicit and explicit params (the former for identity tracking/accumulation, the latter for specifying the structure of the gradient/handling immutable params)?
The text was updated successfully, but these errors were encountered:
What should we expect? That in you example we would have
g =gradient(c) do m
sum(m(x))
end[1]
g.layers[1].weight === g.layers[2].weight # same object
g.layers[1].weight == [0.2282550.446527; 0.1807790.015095] # same as implicit gradient, already accumulated
Oh definitely, and if not for the memory factor I think that would be ideal. All it needs is some additions in Optimisers.jl to define/accumulate gradients intelligently.
Motivating example:
One one hand, you could argue that having different gradients at different locations is semantically more correct for the structural case. On the other hand, using 2x (or higher, if parameters are shared multiple times) more valuable (V)RAM is going to make having one array per location a hard sell for models that are already hitting up against a resource limit. Do we have a mechanism for providing both implicit and explicit params (the former for identity tracking/accumulation, the latter for specifying the structure of the gradient/handling immutable params)?
The text was updated successfully, but these errors were encountered: