-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tied weights (by transposition) are not tied when sent to gpu #1504
Comments
This is more fundamental. It has to do with how data is moved to the GPU. julia> using CUDA
julia> A = rand(3, 3);
julia> B = A;
julia> C, D = (A, B) .|> cu
(Float32[0.939074 0.34757903 0.3379287; 0.18965888 0.723545 0.73492056; 0.045836147 0.9611524 0.28369328], Float32[0.939074 0.34757903 0.3379287; 0.18965888 0.723545 0.73492056; 0.045836147 0.9611524 0.28369328])
julia> A == B
true
julia> C == D
true
julia> A === B
true
julia> C === D
false Calling I'm sure there is a way to detect when you are trying to allocate GPU memory for something that's already been allocated, but it would require some CUDA.jl knowledge that I lack. cc @maleadt |
I actually don't think it's a GPU problem. To wit: using Flux
encoder = Dense(2, 3)
decoder = Dense(transpose(encoder.W), zeros(Float32, 2))
m = Chain(encoder, decoder)
m64 = m |> f64
m64.layers[2].W.parent === m64.layers[1].W # = false, NOT OK, weights not tied The real culprit is here. Essentially, Functors does not attempt to traverse into wrapper array types. This means that the fmap cache (which operates by objectid) will not detect the tied weights if they're behind such a wrapper. I think the easiest way to fix this would be to |
I stand corrected! Hmm, so maybe we should handle this in Functors.jl somehow. Could we dispatch on nested types (e.g. |
Not on nested types like that. Arrays are considered leaves. Same reason why I wouldn't functor Transpose, and also want to remove Cholesky from the same. It can have undocumented side effects. |
Arrays are currently considered leaves, but from the context of I'm just thinking in comparison to something like StructArrays.jl. |
Ah, so that's an interesting point. Think of something like an image - an array of RGB. Here we don't want to functor RGB because it messes with the semantics of what operations with element types are supposed to mean. This is why ColorVectorSpace.jl exists. Functoring that basically means we are taking charge over the properties of the type which we have no knowledge of, meaning we hit incorrect methods and so on when we optimise/ AD with these structs, or more generally perform incorrect/ invalid operations with them. What we want to be able to do is reconstruct the type back to hit the various functions it was intended to hit, preserving the semantics. Same goes for wrapper types. No it's not the leaf, but operations related to it are specialised for reasons of it being a wrapper type. If those types want to expose functionality to manage their resources (like CUDA does with Adapt), then that's okay and encouraged, but that is hard to impose generically. |
I personally would find it surprising if Lazy transpose. Mutating the returned object should appropriately mutate A. If we want to respect those semantics, then I think our hands are tied wrt maintaining object identity. |
#1138 has a pretty good back and forth on functoring array wrappers. |
Yeah I agree with this. Doing it generically for any wrapped array is probably also bad. But certain wrapped arrays, we can safely do the thing that makes sense. In any case, there should be some documentation for this. Like in this example, we could suggest |
Are you wanting the weights not to be tied? |
Yes, with the caveat being that my mental model of transpose assumes an immutable view of the original data (which is not the case in Julia). At the end of the day, I don't think it matters what approach wins out as long as we're consistent and document it up front. |
Not advocating for it, but custom layers/models would solve this (in case someone is searching for it) |
How do you mean? |
Actually explicitly reusing the parameters in a forward pass, instead of depending on CUDA to figure out that the parameters of the second layer should refer to the first. Can be done with a struct that defines a model, for example, instead of depending on Flux machinery (like Dense) |
I think the first part is fair, as long as you can avoid copies. A regular function would be good. The part with the struct, are you saying something like: l1 = L1(...)
l2 = L2(l1.w, ...) |
No, an actual struct that registers the whole forward pass without implying copy. |
@ToucheSir I think we should define |
Another observation I found today while testing examples for FluxML/Zygote.jl#1092: even when using implicit params, Zygote will return separate gradients for the original and transposed arrays. So if any changes are to be made, they will have to start at the AD level. |
When looking at #488, I discovered the following problem
So gpu behavior during training is going to be completely different.
Not sure where the problem is.
fmap
? cc @ToucheSirThe text was updated successfully, but these errors were encountered: