Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structural gradients, parameter sharing and DAGs #1092

Open
ToucheSir opened this issue Oct 3, 2021 · 3 comments
Open

Structural gradients, parameter sharing and DAGs #1092

ToucheSir opened this issue Oct 3, 2021 · 3 comments

Comments

@ToucheSir
Copy link
Member

ToucheSir commented Oct 3, 2021

Motivating example:

using Flux

d1 = Dense(2, 2)
d2 = Dense(d1.weight, d1.bias)

c = Chain(d1, d2)
x = rand(2, 1)

julia> gradient(c) do m
         sum(m(x))
       end
((layers = ((weight = [0.05483223428139582 0.4982746718431635; 0.007355667052721003 0.06684284590835438], bias = Float32[1.0233454, 0.13728034], σ = nothing), (weight = [0.17342295769893964 -0.05174787750918945; 0.17342295769893964 -0.05174787750918945], bias = Float32[1.0, 1.0], σ = nothing)),),)

julia> gradient(params(c)) do
         sum(c(x))
       end.grads
IdDict{Any, Any} with 3 entries:
  Float32[1.01322 0.244673; 0.0101244 -0.107393] => [0.228255 0.446527; 0.180779 0.015095]
  :(Main.x)                                      => [1.03826; 0.235642;;]
  Float32[0.0, 0.0]                              => Float32[2.02335, 1.13728]

One one hand, you could argue that having different gradients at different locations is semantically more correct for the structural case. On the other hand, using 2x (or higher, if parameters are shared multiple times) more valuable (V)RAM is going to make having one array per location a hard sell for models that are already hitting up against a resource limit. Do we have a mechanism for providing both implicit and explicit params (the former for identity tracking/accumulation, the latter for specifying the structure of the gradient/handling immutable params)?

@CarloLucibello
Copy link
Member

CarloLucibello commented Oct 3, 2021

What should we expect? That in you example we would have

g = gradient(c) do m
         sum(m(x))
       end[1]

g.layers[1].weight === g.layers[2].weight # same object
g.layers[1].weight ==  [0.228255 0.446527; 0.180779 0.015095] # same as implicit gradient, already accumulated

?

@darsnack
Copy link
Member

darsnack commented Oct 3, 2021

Bumping my comment from a similar issue. There, I suggested that the optimizers should handle this.

But the memory issue is a valid concern. I would still suggest handling this with the correct array types + accum seems preferable.

@ToucheSir
Copy link
Member Author

Oh definitely, and if not for the memory factor I think that would be ideal. All it needs is some additions in Optimisers.jl to define/accumulate gradients intelligently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants