Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tied weights (by transposition) are not tied when sent to gpu #1504

Open
CarloLucibello opened this issue Feb 12, 2021 · 18 comments
Open

tied weights (by transposition) are not tied when sent to gpu #1504

CarloLucibello opened this issue Feb 12, 2021 · 18 comments

Comments

@CarloLucibello
Copy link
Member

When looking at #488, I discovered the following problem

using Flux
encoder = Dense(2, 3)
decoder = Dense(transpose(encoder.W), zeros(Float32, 2)) 
mcpu = Chain(encoder, decoder)
mgpu = mcpu |> gpu
mcpu.layers[2].W.parent === mcpu.layers[1].W # = true, OK
mgpu.layers[2].W.parent === mgpu.layers[1].W # = false, NOT OK, weights not tied

So gpu behavior during training is going to be completely different.

Not sure where the problem is. fmap? cc @ToucheSir

@darsnack
Copy link
Member

This is more fundamental. It has to do with how data is moved to the GPU.

julia> using CUDA

julia> A = rand(3, 3);

julia> B = A;

julia> C, D = (A, B) .|> cu
(Float32[0.939074 0.34757903 0.3379287; 0.18965888 0.723545 0.73492056; 0.045836147 0.9611524 0.28369328], Float32[0.939074 0.34757903 0.3379287; 0.18965888 0.723545 0.73492056; 0.045836147 0.9611524 0.28369328])

julia> A == B
true

julia> C == D
true

julia> A === B
true

julia> C === D
false

Calling gpu (i.e. cu) on the weight array in encoder creates one CuArray in GPU memory, then calling gpu again on the weight array in decoder creates another CuArray in GPU memory for the same array in CPU memory.

I'm sure there is a way to detect when you are trying to allocate GPU memory for something that's already been allocated, but it would require some CUDA.jl knowledge that I lack.

cc @maleadt

@ToucheSir
Copy link
Member

I actually don't think it's a GPU problem. To wit:

using Flux
encoder = Dense(2, 3)
decoder = Dense(transpose(encoder.W), zeros(Float32, 2)) 
m = Chain(encoder, decoder)
m64 = m |> f64
m64.layers[2].W.parent === m64.layers[1].W # = false, NOT OK, weights not tied

The real culprit is here. Essentially, Functors does not attempt to traverse into wrapper array types. This means that the fmap cache (which operates by objectid) will not detect the tied weights if they're behind such a wrapper. I think the easiest way to fix this would be to @functor Transpose et al, but there may be undesirable side effects that arise if we do so.

@darsnack
Copy link
Member

I stand corrected! Hmm, so maybe we should handle this in Functors.jl somehow. Could we dispatch on nested types (e.g. <:AbstractArray{<:AbstractArray}) separately?

@DhairyaLGandhi
Copy link
Member

Not on nested types like that. Arrays are considered leaves.

Same reason why I wouldn't functor Transpose, and also want to remove Cholesky from the same. It can have undocumented side effects.

@darsnack
Copy link
Member

Arrays are currently considered leaves, but from the context of fmap, should wrapper arrays and pure arrays be distinct? Agree that we shouldn't functor them for sure.

I'm just thinking in comparison to something like StructArrays.jl. replace_storage is analogous to gpu or f64. What is intended is to modify the lowest of leaves (i.e. a collection of numbers). Obviously, I'm not saying restrict the type to arrays of numbers, but it does imply that wrapper arrays are not leaves.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Feb 12, 2021

Ah, so that's an interesting point. Think of something like an image - an array of RGB. Here we don't want to functor RGB because it messes with the semantics of what operations with element types are supposed to mean. This is why ColorVectorSpace.jl exists. Functoring that basically means we are taking charge over the properties of the type which we have no knowledge of, meaning we hit incorrect methods and so on when we optimise/ AD with these structs, or more generally perform incorrect/ invalid operations with them. What we want to be able to do is reconstruct the type back to hit the various functions it was intended to hit, preserving the semantics.

Same goes for wrapper types. No it's not the leaf, but operations related to it are specialised for reasons of it being a wrapper type. If those types want to expose functionality to manage their resources (like CUDA does with Adapt), then that's okay and encouraged, but that is hard to impose generically.

@ToucheSir
Copy link
Member

I personally would find it surprising if transpose(W) behaved differently from transpose(copy(W)) in the context of weight tying. That said, the docs do say:

Lazy transpose. Mutating the returned object should appropriately mutate A.

If we want to respect those semantics, then I think our hands are tied wrt maintaining object identity.

@ToucheSir
Copy link
Member

#1138 has a pretty good back and forth on functoring array wrappers.

@darsnack
Copy link
Member

Same goes for wrapper types. No it's not the leaf, but operations related to it are specialised for reasons of it being a wrapper type. If those types want to expose functionality to manage their resources (like CUDA does with Adapt), then that's okay and encouraged, but that is hard to impose generically.

Yeah I agree with this. Doing it generically for any wrapped array is probably also bad. But certain wrapped arrays, we can safely do the thing that makes sense.

In any case, there should be some documentation for this. Like in this example, we could suggest permutedims.

@darsnack
Copy link
Member

I personally would find it surprising if transpose(W) behaved differently from transpose(copy(W)) in the context of weight tying.

Are you wanting the weights not to be tied?

@ToucheSir
Copy link
Member

Are you wanting the weights not to be tied?

Yes, with the caveat being that my mental model of transpose assumes an immutable view of the original data (which is not the case in Julia). At the end of the day, I don't think it matters what approach wins out as long as we're consistent and document it up front.

@bhvieira
Copy link
Contributor

Not advocating for it, but custom layers/models would solve this (in case someone is searching for it)

@DhairyaLGandhi
Copy link
Member

How do you mean?

@bhvieira
Copy link
Contributor

Actually explicitly reusing the parameters in a forward pass, instead of depending on CUDA to figure out that the parameters of the second layer should refer to the first. Can be done with a struct that defines a model, for example, instead of depending on Flux machinery (like Dense)

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Feb 13, 2021

I think the first part is fair, as long as you can avoid copies. A regular function would be good.

The part with the struct, are you saying something like:

l1 = L1(...)
l2 = L2(l1.w, ...)

@bhvieira
Copy link
Contributor

No, an actual struct that registers the whole forward pass without implying copy.
Something like PairOfLayers(W) and then (f::PairOfLayers)(x) = "some function that uses W and transpose W"

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Feb 16, 2021

@ToucheSir I think we should define functor for Transpose, Adjoint, Diagonal and any other array wrapper we can think of, as you suggest in #1504 (comment).
I don't see any harm in doing that.
Having radically different behavior after |> f64 or |> gpu cannot be considered but a bug.
Would you mind filing a PR to Functors.jl?

@ToucheSir
Copy link
Member

Another observation I found today while testing examples for FluxML/Zygote.jl#1092: even when using implicit params, Zygote will return separate gradients for the original and transposed arrays. So if any changes are to be made, they will have to start at the AD level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

5 participants