tied weights (by transposition) are not tied when sent to gpu #1504

CarloLucibello · 2021-02-12T08:44:39Z

When looking at #488, I discovered the following problem

using Flux
encoder = Dense(2, 3)
decoder = Dense(transpose(encoder.W), zeros(Float32, 2)) 
mcpu = Chain(encoder, decoder)
mgpu = mcpu |> gpu
mcpu.layers[2].W.parent === mcpu.layers[1].W # = true, OK
mgpu.layers[2].W.parent === mgpu.layers[1].W # = false, NOT OK, weights not tied

So gpu behavior during training is going to be completely different.

Not sure where the problem is. fmap? cc @ToucheSir

The text was updated successfully, but these errors were encountered:

darsnack · 2021-02-12T14:54:48Z

This is more fundamental. It has to do with how data is moved to the GPU.

julia> using CUDA

julia> A = rand(3, 3);

julia> B = A;

julia> C, D = (A, B) .|> cu
(Float32[0.939074 0.34757903 0.3379287; 0.18965888 0.723545 0.73492056; 0.045836147 0.9611524 0.28369328], Float32[0.939074 0.34757903 0.3379287; 0.18965888 0.723545 0.73492056; 0.045836147 0.9611524 0.28369328])

julia> A == B
true

julia> C == D
true

julia> A === B
true

julia> C === D
false

Calling gpu (i.e. cu) on the weight array in encoder creates one CuArray in GPU memory, then calling gpu again on the weight array in decoder creates another CuArray in GPU memory for the same array in CPU memory.

I'm sure there is a way to detect when you are trying to allocate GPU memory for something that's already been allocated, but it would require some CUDA.jl knowledge that I lack.

cc @maleadt

ToucheSir · 2021-02-12T17:25:15Z

I actually don't think it's a GPU problem. To wit:

using Flux
encoder = Dense(2, 3)
decoder = Dense(transpose(encoder.W), zeros(Float32, 2)) 
m = Chain(encoder, decoder)
m64 = m |> f64
m64.layers[2].W.parent === m64.layers[1].W # = false, NOT OK, weights not tied

The real culprit is here. Essentially, Functors does not attempt to traverse into wrapper array types. This means that the fmap cache (which operates by objectid) will not detect the tied weights if they're behind such a wrapper. I think the easiest way to fix this would be to @functor Transpose et al, but there may be undesirable side effects that arise if we do so.

darsnack · 2021-02-12T17:33:41Z

I stand corrected! Hmm, so maybe we should handle this in Functors.jl somehow. Could we dispatch on nested types (e.g. <:AbstractArray{<:AbstractArray}) separately?

DhairyaLGandhi · 2021-02-12T18:16:13Z

Not on nested types like that. Arrays are considered leaves.

Same reason why I wouldn't functor Transpose, and also want to remove Cholesky from the same. It can have undocumented side effects.

darsnack · 2021-02-12T19:37:10Z

Arrays are currently considered leaves, but from the context of fmap, should wrapper arrays and pure arrays be distinct? Agree that we shouldn't functor them for sure.

I'm just thinking in comparison to something like StructArrays.jl. replace_storage is analogous to gpu or f64. What is intended is to modify the lowest of leaves (i.e. a collection of numbers). Obviously, I'm not saying restrict the type to arrays of numbers, but it does imply that wrapper arrays are not leaves.

DhairyaLGandhi · 2021-02-12T19:50:54Z

Ah, so that's an interesting point. Think of something like an image - an array of RGB. Here we don't want to functor RGB because it messes with the semantics of what operations with element types are supposed to mean. This is why ColorVectorSpace.jl exists. Functoring that basically means we are taking charge over the properties of the type which we have no knowledge of, meaning we hit incorrect methods and so on when we optimise/ AD with these structs, or more generally perform incorrect/ invalid operations with them. What we want to be able to do is reconstruct the type back to hit the various functions it was intended to hit, preserving the semantics.

Same goes for wrapper types. No it's not the leaf, but operations related to it are specialised for reasons of it being a wrapper type. If those types want to expose functionality to manage their resources (like CUDA does with Adapt), then that's okay and encouraged, but that is hard to impose generically.

ToucheSir · 2021-02-12T19:55:01Z

I personally would find it surprising if transpose(W) behaved differently from transpose(copy(W)) in the context of weight tying. That said, the docs do say:

Lazy transpose. Mutating the returned object should appropriately mutate A.

If we want to respect those semantics, then I think our hands are tied wrt maintaining object identity.

ToucheSir · 2021-02-12T19:56:24Z

#1138 has a pretty good back and forth on functoring array wrappers.

darsnack · 2021-02-12T20:08:14Z

Same goes for wrapper types. No it's not the leaf, but operations related to it are specialised for reasons of it being a wrapper type. If those types want to expose functionality to manage their resources (like CUDA does with Adapt), then that's okay and encouraged, but that is hard to impose generically.

Yeah I agree with this. Doing it generically for any wrapped array is probably also bad. But certain wrapped arrays, we can safely do the thing that makes sense.

In any case, there should be some documentation for this. Like in this example, we could suggest permutedims.

darsnack · 2021-02-12T20:10:32Z

I personally would find it surprising if transpose(W) behaved differently from transpose(copy(W)) in the context of weight tying.

Are you wanting the weights not to be tied?

ToucheSir · 2021-02-12T21:42:00Z

Are you wanting the weights not to be tied?

Yes, with the caveat being that my mental model of transpose assumes an immutable view of the original data (which is not the case in Julia). At the end of the day, I don't think it matters what approach wins out as long as we're consistent and document it up front.

bhvieira · 2021-02-13T18:21:35Z

Not advocating for it, but custom layers/models would solve this (in case someone is searching for it)

DhairyaLGandhi · 2021-02-13T18:58:19Z

How do you mean?

bhvieira · 2021-02-13T19:01:08Z

Actually explicitly reusing the parameters in a forward pass, instead of depending on CUDA to figure out that the parameters of the second layer should refer to the first. Can be done with a struct that defines a model, for example, instead of depending on Flux machinery (like Dense)

DhairyaLGandhi · 2021-02-13T19:10:51Z

I think the first part is fair, as long as you can avoid copies. A regular function would be good.

The part with the struct, are you saying something like:

l1 = L1(...)
l2 = L2(l1.w, ...)

bhvieira · 2021-02-13T19:24:36Z

No, an actual struct that registers the whole forward pass without implying copy.
Something like PairOfLayers(W) and then (f::PairOfLayers)(x) = "some function that uses W and transpose W"

CarloLucibello · 2021-02-16T06:36:10Z

@ToucheSir I think we should define functor for Transpose, Adjoint, Diagonal and any other array wrapper we can think of, as you suggest in #1504 (comment).
I don't see any harm in doing that.
Having radically different behavior after |> f64 or |> gpu cannot be considered but a bug.
Would you mind filing a PR to Functors.jl?

ToucheSir · 2021-10-03T18:43:12Z

Another observation I found today while testing examples for FluxML/Zygote.jl#1092: even when using implicit params, Zygote will return separate gradients for the original and transposed arrays. So if any changes are to be made, they will have to start at the AD level.

DhairyaLGandhi mentioned this issue Feb 24, 2021

define modules function #1444

Merged

4 tasks

ToucheSir mentioned this issue Jun 19, 2021

Frest start : Next steps FluxML/ONNX.jl#49

Open

ToucheSir added this to Explicit Parameter Transition Nov 7, 2021

ToucheSir moved this to Todo in Explicit Parameter Transition Nov 8, 2021

darsnack moved this from Todo to In Progress in Explicit Parameter Transition Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tied weights (by transposition) are not tied when sent to gpu #1504

tied weights (by transposition) are not tied when sent to gpu #1504

CarloLucibello commented Feb 12, 2021

darsnack commented Feb 12, 2021

ToucheSir commented Feb 12, 2021

darsnack commented Feb 12, 2021

DhairyaLGandhi commented Feb 12, 2021

darsnack commented Feb 12, 2021

DhairyaLGandhi commented Feb 12, 2021 •

edited

Loading

ToucheSir commented Feb 12, 2021

ToucheSir commented Feb 12, 2021

darsnack commented Feb 12, 2021

darsnack commented Feb 12, 2021

ToucheSir commented Feb 12, 2021

bhvieira commented Feb 13, 2021

DhairyaLGandhi commented Feb 13, 2021

bhvieira commented Feb 13, 2021

DhairyaLGandhi commented Feb 13, 2021 •

edited

Loading

bhvieira commented Feb 13, 2021

CarloLucibello commented Feb 16, 2021 •

edited

Loading

ToucheSir commented Oct 3, 2021

tied weights (by transposition) are not tied when sent to gpu #1504

tied weights (by transposition) are not tied when sent to gpu #1504

Comments

CarloLucibello commented Feb 12, 2021

darsnack commented Feb 12, 2021

ToucheSir commented Feb 12, 2021

darsnack commented Feb 12, 2021

DhairyaLGandhi commented Feb 12, 2021

darsnack commented Feb 12, 2021

DhairyaLGandhi commented Feb 12, 2021 • edited Loading

ToucheSir commented Feb 12, 2021

ToucheSir commented Feb 12, 2021

darsnack commented Feb 12, 2021

darsnack commented Feb 12, 2021

ToucheSir commented Feb 12, 2021

bhvieira commented Feb 13, 2021

DhairyaLGandhi commented Feb 13, 2021

bhvieira commented Feb 13, 2021

DhairyaLGandhi commented Feb 13, 2021 • edited Loading

bhvieira commented Feb 13, 2021

CarloLucibello commented Feb 16, 2021 • edited Loading

ToucheSir commented Oct 3, 2021

DhairyaLGandhi commented Feb 12, 2021 •

edited

Loading

DhairyaLGandhi commented Feb 13, 2021 •

edited

Loading

CarloLucibello commented Feb 16, 2021 •

edited

Loading