Structural gradients, parameter sharing and DAGs #1092

ToucheSir · 2021-10-03T18:29:02Z

Motivating example:

using Flux

d1 = Dense(2, 2)
d2 = Dense(d1.weight, d1.bias)

c = Chain(d1, d2)
x = rand(2, 1)

julia> gradient(c) do m
         sum(m(x))
       end
((layers = ((weight = [0.05483223428139582 0.4982746718431635; 0.007355667052721003 0.06684284590835438], bias = Float32[1.0233454, 0.13728034], σ = nothing), (weight = [0.17342295769893964 -0.05174787750918945; 0.17342295769893964 -0.05174787750918945], bias = Float32[1.0, 1.0], σ = nothing)),),)

julia> gradient(params(c)) do
         sum(c(x))
       end.grads
IdDict{Any, Any} with 3 entries:
  Float32[1.01322 0.244673; 0.0101244 -0.107393] => [0.228255 0.446527; 0.180779 0.015095]
  :(Main.x)                                      => [1.03826; 0.235642;;]
  Float32[0.0, 0.0]                              => Float32[2.02335, 1.13728]

One one hand, you could argue that having different gradients at different locations is semantically more correct for the structural case. On the other hand, using 2x (or higher, if parameters are shared multiple times) more valuable (V)RAM is going to make having one array per location a hard sell for models that are already hitting up against a resource limit. Do we have a mechanism for providing both implicit and explicit params (the former for identity tracking/accumulation, the latter for specifying the structure of the gradient/handling immutable params)?

CarloLucibello · 2021-10-03T18:39:46Z

What should we expect? That in you example we would have

g = gradient(c) do m
         sum(m(x))
       end[1]

g.layers[1].weight === g.layers[2].weight # same object
g.layers[1].weight ==  [0.228255 0.446527; 0.180779 0.015095] # same as implicit gradient, already accumulated

?

darsnack · 2021-10-03T20:58:59Z

Bumping my comment from a similar issue. There, I suggested that the optimizers should handle this.

But the memory issue is a valid concern. I would still suggest handling this with the correct array types + accum seems preferable.

ToucheSir · 2021-10-03T22:12:47Z

Oh definitely, and if not for the memory factor I think that would be ideal. All it needs is some additions in Optimisers.jl to define/accumulate gradients intelligently.

ToucheSir mentioned this issue Oct 3, 2021

tied weights (by transposition) are not tied when sent to gpu FluxML/Flux.jl#1504

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structural gradients, parameter sharing and DAGs #1092

Structural gradients, parameter sharing and DAGs #1092

ToucheSir commented Oct 3, 2021 •

edited

Loading

CarloLucibello commented Oct 3, 2021 •

edited

Loading

darsnack commented Oct 3, 2021

ToucheSir commented Oct 3, 2021

Structural gradients, parameter sharing and DAGs #1092

Structural gradients, parameter sharing and DAGs #1092

Comments

ToucheSir commented Oct 3, 2021 • edited Loading

CarloLucibello commented Oct 3, 2021 • edited Loading

darsnack commented Oct 3, 2021

ToucheSir commented Oct 3, 2021

ToucheSir commented Oct 3, 2021 •

edited

Loading

CarloLucibello commented Oct 3, 2021 •

edited

Loading