-
-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RNN on GPU fails on first backward call #1114
Comments
Seems like we're sending something off in our parameters in the backwards pass |
Smaller repro: using Flux
using Statistics
using Random
function main()
Random.seed!(1)
rnn = GRU(1, 1)
X = [rand(1,2) for i in 1:2]
Y = rand(2,2) ./ 10
drnn = rnn |> gpu
dX = gpu(X)
dY = gpu(Y)
θ = Flux.params(drnn)
loss(x,y) = mean((Flux.stack(drnn.(dX),2) .- y) .^ 2f0)
opt = ADAM(1e-3)
try
Flux.train!(loss, θ, [(dX,dY)], opt)
catch ex
@error "First training failed" exception=(ex, catch_backtrace())
end
Flux.train!(loss, θ, [(dX,dY)], opt)
@info "Second training succeeded"
end
isinteractive() || main() Looks like the dimensionality of the initial state is wrong. Initially, drnn.state = CuArrays.zeros(Float32, (1,2))
Flux.train!(loss, θ, [(dX,dY)], opt)
# works at the first invocation I'm not sure how/where that data is set and used, so I'll leave that to Flux experts 🙂 |
The issue seems more to do with that our |
(Or |
The (Flux) pkg> st
Project Flux v0.11.1
Status `D:\Github\Flux.jl\Project.toml`
[1520ce14] AbstractTrees v0.3.3
[79e6a3ab] Adapt v2.3.0
[052768ef] CUDA v1.3.3
[944b1d66] CodecZlib v0.7.0
[5ae59095] Colors v0.12.4
[d9f16b24] Functors v0.1.0
[e5e0dc1b] Juno v0.8.4
[1914dd2f] MacroTools v0.5.6
[872c559c] NNlib v0.7.5
[189a3867] Reexport v0.2.0
[2913bbd2] StatsBase v0.33.2
[a5390f91] ZipFile v0.9.3
[e88e6eb3] Zygote v0.5.9
[8bb1440f] DelimitedFiles
[37e2e46d] LinearAlgebra
[44cfe95a] Pkg
[de0858da] Printf
[9a3f8284] Random
[ea8e919c] SHA
[10745b16] Statistics
[8dfed614] Test MRE:
The error shows that the pullback step would be in error: Line 85 in c5c35cc
Which relates to the following CUDA.jl: https://github.com/JuliaGPU/CUDA.jl/blob/31f67d6caf7fca566a9c7b36b850ce31df450ac6/lib/cudnn/rnn.jl#L195 I couldn't figure what's different between the RNN/GRU vs the LSTM that would cause a bug only in the later. I'm not clear about this line where the cell state |
1367: RNN update to drop CUDNN, fix LSTM bug and output type stability r=CarloLucibello a=jeremiedb PR related to #1114 #1360 #1365 Some experiment for RNN handling. Hidden state of each cell structure was dropped as they weren't needed (AFAIK, only needed for size inference for CUDNN, but bias size could be used as a substitute to cells' `h` there as well). Looked to drop dependence on CUDNN entirely, so it's a pure Flux/CUDA.jl. File `src/cuda/curnnjl` no longer used. No modifications were made to the cell computations. Initial test seems to show decent performance, but yet to benchmark. Pending issue: despite having dropped completely the CUDNN dependency, there's still an instability issue that seems present when running on GPU. This is illustrated in the test at lines 1-50 of file `test\rnn-test-jdb.jl`. If that test runs on CPU, it goes well thorugh the 100 iterations. However, the same on GPU will thow NAs after couple dozens of iterations. My only hypothesis so far: when performing the iteration over the sequence through `m.(x)` or `map(rnn, x)`, is the order of the execution safe? Ie: is it possible that there isn't a `sync()` on the CUDA side between those seq steps, which may mess up the state? ### PR Checklist - [x] Tests are added - [ ] Entry in NEWS.md - [ ] Documentation, if applicable - [ ] Final review from `@dhairyagandhi96` (for API changes). Co-authored-by: jeremiedb <[email protected]> Co-authored-by: jeremie.db <[email protected]>
@jeremiedb is this fixed by your PR? |
Yes it's now working in master. Closing the issue. |
In the following minimal RNN example, first call to
Flux.train!
fails withCUDNN_STATUS_BAD_PARAM
. Same error raised either cell isGRU
,RNN
orLSTM
.It can be observed that both prior and after reset!, rnn state is of size (8), while after a call to train! on GPU, state becomes the expected proper size (8,10). After each call to
reset!
, the CUDNN_STATUS_BAD_PARAM error pops out after first call to train!, but subsequent ones are fine as the state size stays (8,10). Can't confirm whether that state size is the root cause, but appears closely tied to the bug. Also, a call toloss(X,Y)
results in a proper state dimension of (8,10). Running on CPU doesn't result in any error/warning.Pkg info (same error also raised wih latest Zygote release):
CUDA: 10.1.168
CUDNN: 7.6.5
The text was updated successfully, but these errors were encountered: