-
-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RNN test failures with CUDNN #267
Comments
This is an issue with the CUDNN RNN APIs, which oddly enough I've never seen when actually using them, but which come up regularly in the tests. As CUDNN has just added API logging, I'm hoping that will help me debug this when I get round to it. |
FYI, GPU CI has moved, and somebody with Flux.jl ownership permissions should add it to the JuliaGPU GitLab group. See https://github.com/JuliaGPU/gitlab-ci |
Just thought I'd add in here that I am receiving the same error as @CarloLucibello. The (truncated) stacktrace is below: [ Info: Testing Flux/CUDNN
batch_size = 1: Error During Test at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9
Got exception outside of a @test
CUDNNError(code 3, CUDNN_STATUS_BAD_PARAM)
Stacktrace:
[1] macro expansion at /home/jacobr/.julia/packages/CuArrays/f4Eke/src/dnn/error.jl:19 [inlined]
[2] cudnnRNNBackwardData(::Flux.CUDA.RNNDesc{Float32}, ::Int64, ::Array{CuArrays.CUDNN.TensorDesc,1}, ::CuArray{Float32,1}, ::Array{CuArrays.CUDNN.TensorDesc,1}, ::CuArray{Float32,1}, ::CuArrays.CUDNN.TensorDesc, ::CuArray{Float32,1}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::CuArrays.CUDNN.FilterDesc, ::CuArray{Float32,1}, ::CuArrays.CUDNN.TensorDesc, ::CuArray{Float32,1}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::Array{CuArrays.CUDNN.TensorDesc,1}, ::CuArray{Float32,1}, ::CuArrays.CUDNN.TensorDesc, ::CuArray{Float32,1}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::CuArray{UInt8,1}, ::CuArray{UInt8,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/cuda/cudnn.jl:193
[3] backwardData(::Flux.CUDA.RNNDesc{Float32}, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::Nothing, ::CuArray{Float32,1}, ::Nothing, ::CuArray{UInt8,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/cuda/cudnn.jl:210
[4] backwardData(::Flux.CUDA.RNNDesc{Float32}, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::CuArray{UInt8,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/cuda/cudnn.jl:218
[5] (::getfield(Flux.CUDA, Symbol("##11#12")){Flux.GRUCell{TrackedArray{…,CuArray{Float32,2}},TrackedArray{…,CuArray{Float32,1}}},TrackedArray{…,CuArray{Float32,1}},TrackedArray{…,CuArray{Float32,1}},CuArray{UInt8,1},Tuple{CuArray{Float32,1},CuArray{Float32,1}}})(::Tuple{CuArray{Float32,1},CuArray{Float32,1}}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/cuda/cudnn.jl:329
[6] back_(::Flux.Tracker.Call{getfield(Flux.CUDA, Symbol("##11#12")){Flux.GRUCell{TrackedArray{…,CuArray{Float32,2}},TrackedArray{…,CuArray{Float32,1}}},TrackedArray{…,CuArray{Float32,1}},TrackedArray{…,CuArray{Float32,1}},CuArray{UInt8,1},Tuple{CuArray{Float32,1},CuArray{Float32,1}}},Tuple{Flux.Tracker.Tracked{CuArray{Float32,1}},Flux.Tracker.Tracked{CuArray{Float32,1}},Flux.Tracker.Tracked{CuArray{Float32,2}},Flux.Tracker.Tracked{CuArray{Float32,2}},Flux.Tracker.Tracked{CuArray{Float32,1}}}}, ::Tuple{CuArray{Float32,1},CuArray{Float32,1}}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/tracker/back.jl:23
[7] back(::Flux.Tracker.Tracked{Tuple{CuArray{Float32,1},CuArray{Float32,1}}}, ::Tuple{CuArray{Float32,1},Int64}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/tracker/back.jl:43
[8] foreach(::Function, ::Tuple{Flux.Tracker.Tracked{Tuple{CuArray{Float32,1},CuArray{Float32,1}}},Nothing}, ::Tuple{Tuple{CuArray{Float32,1},Int64},Nothing}) at ./abstractarray.jl:1836
[9] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##328#330")){Flux.Tracker.TrackedTuple{Tuple{CuArray{Float32,1},CuArray{Float32,1}}},Int64},Tuple{Flux.Tracker.Tracked{Tuple{CuArray{Float32,1},CuArray{Float32,1}}},Nothing}}, ::CuArray{Float32,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/tracker/back.jl:26
[10] back(::Flux.Tracker.Tracked{CuArray{Float32,1}}, ::CuArray{Float32,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/tracker/back.jl:45
[11] back!(::TrackedArray{…,CuArray{Float32,1}}, ::CuArray{Float32,1}) at /home/jacobr/.julia/packages/Flux/jsf3Y/src/tracker/back.jl:62
[12] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:25 [inlined]
[13] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
[14] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9 [inlined]
[15] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
[16] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6 [inlined]
[17] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
[18] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6
[19] include at ./boot.jl:317 [inlined]
[20] include_relative(::Module, ::String) at ./loading.jl:1044
[21] include(::Module, ::String) at ./sysimg.jl:29
[22] include(::String) at ./client.jl:392
[23] top-level scope at none:0
[24] include at ./boot.jl:317 [inlined]
[25] include_relative(::Module, ::String) at ./loading.jl:1044
[26] include(::Module, ::String) at ./sysimg.jl:29
[27] include(::String) at ./client.jl:392
[28] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl:45 [inlined]
[29] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
[30] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl:26
[31] include at ./boot.jl:317 [inlined]
[32] include_relative(::Module, ::String) at ./loading.jl:1044
[33] include(::Module, ::String) at ./sysimg.jl:29
[34] exec_options(::Base.JLOptions) at ./client.jl:266
[35] _start() at ./client.jl:425
batch_size = 5: Test Failed at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:28
Expression: ((rnn.cell).Wi).grad ≈ collect(((curnn.cell).Wi).grad)
Evaluated: [-0.00367632 -0.00351105 … -0.00199258 -0.00363324; 0.0218888 0.0180698 … 0.0237639 0.0238772; … ; -1.5432 -1.33318 … -1.58462 -1.74017; -1.05911 -1.06296 … -1.05237 -1.35042] ≈ Float32[-0.00157259 -0.00118586 … -0.00106842 -0.0011475; 0.0174023 0.0131109 … 0.021793 0.018576; … ; -1.82545 -1.64515 … -1.70861 -2.07368; -0.937059 -0.928057 … -0.998756 -1.2062]
Stacktrace:
[1] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:28 [inlined]
[2] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
[3] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9 [inlined]
[4] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
[5] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6 [inlined]
[6] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
[7] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6
batch_size = 5: Test Failed at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:29
Expression: ((rnn.cell).Wh).grad ≈ collect(((curnn.cell).Wh).grad)
Evaluated: [-0.00167179 -0.000571634 … -0.00737623 -0.00230114; 0.00472922 -0.000953073 … -0.00443951 0.0117585; … ; -0.0160675 0.0515638 … 0.208142 -0.385053; 0.0282091 0.044212 … 0.154527 -0.288217] ≈ Float32[-0.00680693 -0.00125408 … -0.0138485 0.00544874; 0.00551455 -0.000585967 … -0.0031398 0.0100578; … ; -0.00432964 0.0610827 … 0.232323 -0.418382; 0.0216776 0.0394996 … 0.14176 -0.270818]
Stacktrace:
[1] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:29 [inlined]
[2] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
[3] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9 [inlined]
[4] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
[5] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6 [inlined]
[6] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
[7] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6
batch_size = 5: Test Failed at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:30
Expression: ((rnn.cell).b).grad ≈ collect(((curnn.cell).b).grad)
Evaluated: [-0.00485622, 0.0356765, 0.0566717, -0.0967002, -0.0736521, 0.415462, 0.169964, -0.183906, 1.05687, -0.911806, 0.00403009, -1.50994, -0.834565, -2.4546, -1.56953] ≈ Float32[-0.00230207, 0.0302294, 0.0633711, -0.100256, -0.0690399, 0.270476, 0.153396, -0.207993, 1.27673, -0.803211, 0.344462, -0.880549, -1.12154, -2.79729, -1.42135]
Stacktrace:
[1] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:30 [inlined]
[2] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
[3] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9 [inlined]
[4] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
[5] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6 [inlined]
[6] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
[7] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6
batch_size = 5: Test Failed at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:31
Expression: ((rnn.cell).h).grad ≈ collect(((curnn.cell).h).grad)
Evaluated: [-0.236697, -0.686411, -0.373723, -0.637998, -1.26944] ≈ Float32[-0.0674129, -0.592256, -0.351478, -0.822845, -1.20205]
Stacktrace:
[1] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:31 [inlined]
[2] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
[3] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:9 [inlined]
[4] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1156 [inlined]
[5] macro expansion at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6 [inlined]
[6] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
[7] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/cuda/cudnn.jl:6
Test Summary: | Pass Fail Error Total
Flux | 399 4 1 404
Throttle | 11 11
Jacobian | 1 1
Initialization | 14 14
Params | 2 2
onecold | 4 4
Optimise | 10 10
Training Loop | 1 1
basic | 17 17
Dropout | 8 8
BatchNorm | 13 13
losses | 12 12
Pooling | 2 2
CNN | 1 1
Tracker | 248 248
CuArrays | 7 7
RNN | 40 4 1 45
R = Flux.RNN | 16 16
R = Flux.GRU | 6 4 1 11
batch_size = 1 | 2 1 3
batch_size = 5 | 4 4 8
R = Flux.LSTM | 18 18
ERROR: LoadError: Some tests did not pass: 399 passed, 4 failed, 1 errored, 0 broken.
in expression starting at /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl:24
ERROR: LoadError: failed process: Process(`/home/jacobr/code/julia-1.0.3/bin/julia -Cnative -J/home/jacobr/code/julia-1.0.3/lib/julia/sys.so --compile=yes --depwarn=yes --color=yes --compiled-modules=yes --startup-file=no --code-coverage=none /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl`, ProcessExited(1)) [1]
Stacktrace:
[1] error(::String, ::Base.Process, ::String, ::Int64, ::String) at ./error.jl:42
[2] pipeline_error at ./process.jl:705 [inlined]
[3] #run#503(::Bool, ::Function, ::Cmd) at ./process.jl:663
[4] run(::Cmd) at ./process.jl:661
[5] top-level scope at /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl:5
[6] include at ./boot.jl:317 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1044
[8] include(::Module, ::String) at ./sysimg.jl:29
[9] include(::String) at ./client.jl:392
[10] top-level scope at none:0
in expression starting at /home/jacobr/.julia/packages/Flux/jsf3Y/test/runtests.jl:3
ERROR: Package Flux errored during testing I'm using julia v1.0.3 and the package versions are: Status `~/.julia/environments/v1.0/Project.toml`
[3895d2a7] CUDAapi v0.5.3+ #master (https://github.com/JuliaGPU/CUDAapi.jl.git)
[3a865a2d] CuArrays v0.8.1
[587475ba] Flux v0.6.10 Is this an actual problem with any part of the Flux implementation? Or is this just some problem with the unittests? |
That test failing with |
Unfortunately this error / family of errors is really difficult to debug; it's not deterministic, doesn't show up in interactive sessions, and CUDA api logging doesn't reveal anything insightful. I'm vaguely hoping that once we unify the Knet and Flux RNN wrappers into CuArrays, this will magically go away. |
I got this error not in CI, but when running a real model. I had to manually edit https://discourse.julialang.org/t/flux-rnns-leak-into-curnns/25661 |
Should this be closed as well? |
In the last month, CI on gpu has been consistently failing with the following error:
The text was updated successfully, but these errors were encountered: