Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDNN convolution allocates outside of the memory pool #111

Closed
jonathan-laurent opened this issue Nov 13, 2019 · 10 comments
Closed

CUDNN convolution allocates outside of the memory pool #111

jonathan-laurent opened this issue Nov 13, 2019 · 10 comments
Labels
bug Something isn't working cuda array Stuff about CuArray.

Comments

@jonathan-laurent
Copy link

When training a ResNet using either Flux or Knet, I am encountering runtime errors due to the GPU running out of memory. Typical errors include:

ERROR: LoadError: CUDAdrv.CuError(701, nothing)

with Flux and

ERROR: LoadError: cudnnFindConvolutionForwardAlgorithm: 2: CUDNN_STATUS_ALLOC_FAILED

with Knet (full stack traces are available below). The only way I found to eliminate these errors was to insert explicit periodic calls to the garbage collector in my code using GC.gc().

My question is the following: why are these calls necessary and why isn't CuArrays calling the GC automatically when running out of GPU memory?

Also, note that after inserting explicit calls to the garbage collector, I am still encountering the seemingly common issue (https://github.com/JuliaGPU/CuArrays.jl/issues/323, FluxML/Flux.jl#736, https://github.com/JuliaGPU/CuArrays.jl/issues/273) where training slows down considerably after a few training epochs (~4x performance hit in my case).

Details

  • I am using Julia 1.2.0 with CuArrays 1.3.0, Flux 0.9 and Knet 1.2.7.
  • My GPU is a Nvidia RTX2070 with 8GB of memory.

The two runtime errors below arise after a few seconds, which corresponds to evaluating a Resnet (10 convolutional layers, 500K parameters) on about 40,000 samples (with minibatches of size 64). A quick back-of-the-envelope calculation shows that running inference on a single sample should allocate at most 0.2MB and so this roughly corresponds to the time it should take to fill my 8GB GPU memory if the GC is never called.

I am currently working on providing an easy way to replicate this result. In the meantime, here are two typical stack traces.

Flux stack trace

ERROR: LoadError: CUDAdrv.CuError(701, nothing)
Stacktrace:
 [1] (::getfield(CUDAdrv, Symbol("##25#26")){Bool,Int64,CUDAdrv.CuStream,CUDAdrv.CuFunction})(::Array{Ptr{Nothing},1}) at /home/jonathan/.julia/packages/CUDAdrv/ADRHQ/src/base.jl:145
 [2] macro expansion at /home/jonathan/.julia/packages/CUDAdrv/ADRHQ/src/execution.jl:63 [inlined]
 [3] macro expansion at ./gcutils.jl:87 [inlined]
 [4] macro expansion at /home/jonathan/.julia/packages/CUDAdrv/ADRHQ/src/execution.jl:61 [inlined]
 [5] pack_arguments(::getfield(CUDAdrv, Symbol("##25#26")){Bool,Int64,CUDAdrv.CuStream,CUDAdrv.CuFunction}, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.A
S.Global}, ::CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}, ::Int64, ::Int64) at /home/jonathan/.julia/packages/CUDAdrv/ADRHQ/src/execution.jl:40
 [6] #launch#24(::Int64, ::Tuple{Int64,Int64,Int64}, ::Bool, ::Int64, ::CUDAdrv.CuStream, ::typeof(CUDAdrv.launch), ::CUDAdrv.CuFunction, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::Vararg{Any,N} whe
re N) at /home/jonathan/.julia/packages/CUDAdrv/ADRHQ/src/execution.jl:90
 [7] #launch at ./none:0 [inlined]
 [8] JuliaGPU/CuArrays.jl#30 at /home/jonathan/.julia/packages/CUDAdrv/ADRHQ/src/execution.jl:179 [inlined]
 [9] macro expansion at /home/jonathan/.julia/packages/CUDAdrv/ADRHQ/src/execution.jl:140 [inlined]
 [10] macro expansion at ./gcutils.jl:87 [inlined]
 [11] macro expansion at /home/jonathan/.julia/packages/CUDAdrv/ADRHQ/src/execution.jl:139 [inlined]
 [12] convert_arguments at /home/jonathan/.julia/packages/CUDAdrv/ADRHQ/src/execution.jl:123 [inlined]
 [13] #cudacall#29 at /home/jonathan/.julia/packages/CUDAdrv/ADRHQ/src/execution.jl:178 [inlined]
 [14] #cudacall at ./none:0 [inlined]
 [15] #cudacall#175 at /home/jonathan/.julia/packages/CUDAnative/Lr0yj/src/execution.jl:280 [inlined]
 [16] #cudacall at ./none:0 [inlined]
 [17] macro expansion at /home/jonathan/.julia/packages/CUDAnative/Lr0yj/src/execution.jl:261 [inlined]
 [18] #call#163(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol},NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64,Int64},Int64}}}, ::typeof(CUDAnative.call), ::CUDAnative.HostKernel{CuArrays.mapreducedim_kernel_parallel,Tuple{typeof(identity),typeof(Base.add_sum),CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}},Int64,Int64}}, ::typeof(identity), ::typeof(Base.add_sum), ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}, ::Int64, ::Int64) at /home/jonathan/.julia/packages/CUDAnative/Lr0yj/src/execution.jl:238
 [19] (::getfield(CUDAnative, Symbol("#kw##call")))(::NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64,Int64},Int64}}, ::typeof(CUDAnative.call), ::CUDAnative.HostKernel{CuArrays.mapreducedim_kernel_parallel,Tuple{typeof(identity),typeof(Base.add_sum),CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}},Int64,Int64}}, ::Function, ::Vararg{Any,N} where N) at ./none:0
 [20] #call#178(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol},NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64,Int64},Int64}}}, ::CUDAnative.HostKernel{CuArrays.mapreducedim_kernel_parallel,Tuple{typeof(identity),typeof(Base.add_sum),CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}},Int64,Int64}}, ::Function, ::Vararg{Any,N} where N) at /home/jonathan/.julia/packages/CUDAnative/Lr0yj/src/execution.jl:407
 [21] (::getfield(CUDAnative, Symbol("#kw#HostKernel")))(::NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64,Int64},Int64}}, ::CUDAnative.HostKernel{CuArrays.mapreducedim_kernel_parallel,Tuple{typeof(identity),typeof(Base.add_sum),CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}},Int64,Int64}}, ::Function, ::Vararg{Any,N} where N) at ./none:0
 [22] macro expansion at /home/jonathan/.julia/packages/CuArrays/kOUu1/src/mapreduce.jl:87 [inlined]
 [23] macro expansion at ./gcutils.jl:87 [inlined]
 [24] _mapreducedim!(::Function, ::Function, ::CuArrays.CuArray{Float32,2}, ::CuArrays.CuArray{Float32,2}) at /home/jonathan/.julia/packages/CuArrays/kOUu1/src/mapreduce.jl:65
 [25] mapreducedim! at ./reducedim.jl:274 [inlined]
 [26] _mapreduce_dim at ./reducedim.jl:317 [inlined]
 [27] mapreduce_impl at /home/jonathan/.julia/packages/GPUArrays/tIMl5/src/mapreduce.jl:79 [inlined]
 [28] #mapreduce#50 at /home/jonathan/.julia/packages/GPUArrays/tIMl5/src/mapreduce.jl:65 [inlined]
 [29] #mapreduce at ./none:0 [inlined]
 [30] _sum at ./reducedim.jl:679 [inlined]
 [31] _sum at ./reducedim.jl:678 [inlined]
 [32] #sum#558 at ./reducedim.jl:652 [inlined]
 [33] #sum at ./none:0 [inlined]
 [34] #_forward#481 at /home/jonathan/.julia/packages/Tracker/m6d46/src/lib/array.jl:315 [inlined]
 [35] #_forward at ./none:0 [inlined]
 [36] #track#1 at /home/jonathan/.julia/packages/Tracker/m6d46/src/Tracker.jl:52 [inlined]
 [37] #track at ./none:0 [inlined]
 [38] #sum#480 at /home/jonathan/.julia/packages/Tracker/m6d46/src/lib/array.jl:312 [inlined]
 [39] #sum at ./none:0 [inlined]
 [40] unbroadcast(::Tracker.TrackedArray{Float32,1,CuArrays.CuArray{Float32,1}}, ::Tracker.TrackedArray{Float32,2,CuArrays.CuArray{Float32,2}}) at /home/jonathan/.julia/packages/Tracker/m6d46/src/lib/array.jl:492
 [41] back_(::Tracker.Call{getfield(Tracker, Symbol("#back#548")){2,getfield(Base.Broadcast, Symbol("##2#4")){getfield(Base.Broadcast, Symbol("##8#10")){getfield(Base.Broadcast, Symbol("##1#3")),getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##7#9"))}},getfield(Base.Broadcast, Symbol("##11#12")){getfield(Base.Broadcast, Symbol("##11#12")){getfield(Base.Broadcast, Symbol("##13#14"))}},getfield(Base.Broadcast, Symbol("##15#16")){getfield(Base.Broadcast, Symbol("##15#16")){getfield(Base.Broadcast, Symbol("##17#18"))}},typeof(+)},typeof(CUDAnative.tanh)},Tuple{Tracker.TrackedArray{Float32,2,CuArrays.CuArray{Float32,2}},Tracker.TrackedArray{Float32,1,CuArrays.CuArray{Float32,1}}}},Tuple{Tracker.Tracked{CuArrays.CuArray{Float32,2}},Tracker.Tracked{CuArrays.CuArray{Float32,1}}}}, ::CuArrays.CuArray{Float32,2}, ::Bool) at ./tuple.jl:159
 [42] back(::Tracker.Tracked{CuArrays.CuArray{Float32,2}}, ::CuArrays.CuArray{Float32,2}, ::Bool) at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:58
 [43] (::getfield(Tracker, Symbol("##13#14")){Bool})(::Tracker.Tracked{CuArrays.CuArray{Float32,2}}, ::CuArrays.CuArray{Float32,2}) at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:38
 [44] foreach(::Function, ::Tuple{Tracker.Tracked{CuArrays.CuArray{Float32,2}},Nothing,Tracker.Tracked{CuArrays.CuArray{Float32,2}},Nothing,Nothing}, ::NTuple{5,CuArrays.CuArray{Float32,2}}) at ./abstractarray.jl:1921
 [45] back_(::Tracker.Call{getfield(Tracker, Symbol("#back#548")){5,getfield(Base.Broadcast, Symbol("##2#4")){getfield(Base.Broadcast, Symbol("##8#10")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##1#3"))},getfield(Base.Broadcast, Symbol("##8#10")){getfield(Base.Broadcast, Symbol("##8#10")){getfield(Base.Broadcast, Symbol("##7#9")),getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##7#9"))}},getfield(Base.Broadcast, Symbol("##11#12")){getfield(Base.Broadcast, Symbol("##11#12")){getfield(Base.Broadcast, Symbol("##13#14"))}},getfield(Base.Broadcast, Symbol("##15#16")){getfield(Base.Broadcast, Symbol("##15#16")){getfield(Base.Broadcast, Symbol("##17#18"))}},typeof(-)},getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##7#9"))}},getfield(Base.Broadcast, Symbol("##11#12")){getfield(Base.Broadcast, Symbol("##11#12")){getfield(Base.Broadcast, Symbol("##13#14"))}},getfield(Base.Broadcast, Symbol("##15#16")){getfield(Base.Broadcast, Symbol("##15#16")){getfield(Base.Broadcast, Symbol("##17#18"))}},typeof(-)},getfield(Base.Broadcast, Symbol("##11#12")){getfield(Base.Broadcast, Symbol("##11#12")){getfield(Base.Broadcast, Symbol("##13#14"))}},getfield(Base.Broadcast, Symbol("##15#16")){getfield(Base.Broadcast, Symbol("##15#16")){getfield(Base.Broadcast, Symbol("##17#18"))}},typeof(*)},typeof(*)},Tuple{Tracker.TrackedArray{Float32,2,CuArrays.CuArray{Float32,2}},CuArrays.CuArray{Float32,2},Tracker.TrackedArray{Float32,2,CuArrays.CuArray{Float32,2}},CuArrays.CuArray{Float32,2},CuArrays.CuArray{Float32,2}}},Tuple{Tracker.Tracked{CuArrays.CuArray{Float32,2}},Nothing,Tracker.Tracked{CuArrays.CuArray{Float32,2}},Nothing,Nothing}}, ::CuArrays.CuArray{Float32,2}, ::Bool) at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:38
 [46] back(::Tracker.Tracked{CuArrays.CuArray{Float32,2}}, ::CuArrays.CuArray{Float32,2}, ::Bool) at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:58
 [47] JuliaGPU/CuArrays.jl#13 at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:38 [inlined]
 [48] foreach at ./abstractarray.jl:1921 [inlined]
 ... (the last 4 lines are repeated 5 more times)
 [69] back_(::Tracker.Call{getfield(Tracker, Symbol("##279#282")){Float32},Tuple{Nothing,Tracker.Tracked{Float32}}}, ::Float32, ::Bool) at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:38
 [70] back(::Tracker.Tracked{Float32}, ::Int64, ::Bool) at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:58
 [71] #back!#15 at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:77 [inlined]
 [72] #back! at ./none:0 [inlined]
 [73] #back!#32 at /home/jonathan/.julia/packages/Tracker/m6d46/src/lib/real.jl:16 [inlined]
 [74] back!(::Tracker.TrackedReal{Float32}) at /home/jonathan/.julia/packages/Tracker/m6d46/src/lib/real.jl:14
 [75] gradient_(::getfield(Flux.Optimise, Symbol("##14#20")){getfield(AlphaZero, Symbol("#loss#42")){AlphaZero.Trainer},Tuple{CuArrays.CuArray{Float32,2},CuArrays.CuArray{Float32,4},CuArrays.CuArray{Float32,2},CuArrays.CuArray{Float32,2},CuArrays.CuArray{Float32,2}}}, ::Tracker.Params) at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:4
 [76] #gradient#24(::Bool, ::typeof(Tracker.gradient), ::Function, ::Tracker.Params) at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:164
 [77] gradient at /home/jonathan/.julia/packages/Tracker/m6d46/src/back.jl:164 [inlined]
 [78] macro expansion at /home/jonathan/.julia/packages/Flux/dkJUV/src/optimise/train.jl:71 [inlined]
 [79] macro expansion at /home/jonathan/.julia/packages/Juno/oLB1d/src/progress.jl:119 [inlined]
 [80] #train!#12(::getfield(Flux.Optimise, Symbol("##16#22")), ::typeof(Flux.Optimise.train!), ::Function, ::Tracker.Params, ::Base.Generator{Base.Generator{Array{Tuple{Array{Float32,2},Array{Float32,4},Array{Float32,2},Array{Float32,2},Array{Float32,2}},1},getfield(AlphaZero.Util, Symbol("##7#9")){getfield(AlphaZero, Symbol("##40#43")){AlphaZero.Trainer}}},getfield(AlphaZero.Util, Symbol("#process#10")){getfield(AlphaZero, Symbol("##41#44")){AlphaZero.Trainer},Int64}}, ::Flux.Optimise.ADAM) at /home/jonathan/.julia/packages/Flux/dkJUV/src/optimise/train.jl:69
 [81] train! at /home/jonathan/.julia/packages/Flux/dkJUV/src/optimise/train.jl:67 [inlined]
 [82] train!(::ResNet{Game}, ::Function, ::Base.Generator{Base.Generator{Array{Tuple{Array{Float32,2},Array{Float32,4},Array{Float32,2},Array{Float32,2},Array{Float32,2}},1},getfield(AlphaZero.Util, Symbol("##7#9")){getfield(AlphaZero, Symbol("##40#43")){AlphaZero.Trainer}}},getfield(AlphaZero.Util, Symbol("#process#10")){getfield(AlphaZero, Symbol("##41#44")){AlphaZero.Trainer},Int64}}, ::Float32) at /home/jonathan/AlphaZero.jl/src/Flux/FluxNets.jl:81
 [83] training_epoch!(::AlphaZero.Trainer) at /home/jonathan/AlphaZero.jl/src/Learning.jl:95
 [84] macro expansion at ./util.jl:213 [inlined]
 [85] learning!(::Env{Game,ResNet{Game},Board}, ::Session{Env{Game,ResNet{Game},Board}}) at /home/jonathan/AlphaZero.jl/src/Training.jl:91
 [86] top-level scope at /home/jonathan/.julia/packages/JuliaInterpreter/MXq3U/src/construct.jl:35
 [87] include at ./boot.jl:328 [inlined]
 [88] include_relative(::Module, ::String) at ./loading.jl:1094
 [89] include(::Module, ::String) at ./Base.jl:31
 [90] include(::String) at ./client.jl:431

Knet stack trace

ERROR: LoadError: cudnnFindConvolutionForwardAlgorithm: 2: CUDNN_STATUS_ALLOC_FAILED
 [1] error(::String) at ./error.jl:33
 [2] macro expansion at /home/jonathan/.julia/packages/Knet/IIjk8/src/gpu.jl:33 [inlined]
 [3] #conv4_algo#459(::Ptr{Nothing}, ::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:padding,),Tuple{Int64}}}, ::typeof(Knet.conv4_algo), ::Knet.KnetArray{Float32,4}, ::Knet.KnetArray{Float32,4}, ::Knet.KnetArray{Float32,4}) at /home/jonathan/.julia/packages/Knet/IIjk8/src/conv.jl:518
 [4] #conv4#261(::Ptr{Nothing}, ::Int64, ::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:padding,),Tuple{Int64}}}, ::typeof(Knet.conv4), ::Knet.KnetArray{Float32,4}, ::Knet.KnetArray{Float32,4}) at ./none:0
 [5] (::getfield(Knet, Symbol("#kw##conv4")))(::NamedTuple{(:padding,),Tuple{Int64}}, ::typeof(Knet.conv4), ::Knet.KnetArray{Float32,4}, ::Knet.KnetArray{Float32,4}) at ./none:0
 [6] #forw#1(::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:padding,),Tuple{Int64}}}, ::typeof(AutoGrad.forw), ::Function, ::AutoGrad.Param{Knet.KnetArray{Float32,4}}, ::Vararg{Any,N} where N) at /home/jonathan/.julia/packages/AutoGrad/9MrCC/src/core.jl:66
 [7] #forw at ./none:0 [inlined]
 [8] #conv4#264 at ./none:0 [inlined]
 [9] (::getfield(Knet, Symbol("#kw##conv4")))(::NamedTuple{(:padding,),Tuple{Int64}}, ::typeof(Knet.conv4), ::AutoGrad.Param{Knet.KnetArray{Float32,4}}, ::AutoGrad.Result{Knet.KnetArray{Float32,4}}) at ./none:0
 [10] (::AlphaZero.KNets.Conv)(::AutoGrad.Result{Knet.KnetArray{Float32,4}}) at /home/jonathan/AlphaZero.jl/src/Knet/Layers.jl:58
 [11] (::AlphaZero.KNets.Chain)(::AutoGrad.Result{Knet.KnetArray{Float32,4}}) at /home/jonathan/AlphaZero.jl/src/Knet/Layers.jl:17
 [12] forward(::ResNet{Game}, ::Knet.KnetArray{Float32,4}) at /home/jonathan/AlphaZero.jl/src/Knet/KNets.jl:101
 [13] evaluate(::ResNet{Game}, ::Knet.KnetArray{Float32,4}, ::Knet.KnetArray{Float32,2}) at /home/jonathan/AlphaZero.jl/src/Network.jl:208
 [14] losses(::ResNet{Game}, ::LearningParams, ::Float32, ::Float32, ::Tuple{Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,4},Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,2}}) at /home/jonathan/AlphaZero.jl/src/Learning.jl:44
 [15] (::getfield(AlphaZero, Symbol("#loss#42")){AlphaZero.Trainer})(::Knet.KnetArray{Float32,2}, ::Vararg{Any,N} where N) at /home/jonathan/AlphaZero.jl/src/Learning.jl:87
 [16] (::getfield(Knet, Symbol("##695#696")){Knet.Minimize{Base.Generator{Base.Generator{Array{Tuple{Array{Float32,2},Array{Float32,4},Array{Float32,2},Array{Float32,2},Array{Float32,2}},1},getfield(AlphaZero.Util, Symbol("##7#9")){getfield(AlphaZero, Symbol("##40#43")){AlphaZero.Trainer}}},getfield(AlphaZero.Util, Symbol("#process#10")){getfield(AlphaZero, Symbol("##41#44")){AlphaZero.Trainer},Int64}}},Tuple{Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,4},Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,2},Knet.KnetArray{Float32,2}}})() at /home/jonathan/.julia/packages/AutoGrad/9MrCC/src/core.jl:205
 [17] #differentiate#3(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(AutoGrad.differentiate), ::Function) at /home/jonathan/.julia/packages/AutoGrad/9MrCC/src/core.jl:144
 [18] differentiate at /home/jonathan/.julia/packages/AutoGrad/9MrCC/src/core.jl:135 [inlined]
 [19] iterate at /home/jonathan/.julia/packages/Knet/IIjk8/src/train.jl:23 [inlined]
 [20] _collect(::UnitRange{Int64}, ::Knet.Minimize{Base.Generator{Base.Generator{Array{Tuple{Array{Float32,2},Array{Float32,4},Array{Float32,2},Array{Float32,2},Array{Float32,2}},1},getfield(AlphaZero.Util, Symbol("##7#9")){getfield(AlphaZero, Symbol("##40#43")){AlphaZero.Trainer}}},getfield(AlphaZero.Util, Symbol("#process#10")){getfield(AlphaZero, Symbol("##41#44")){AlphaZero.Trainer},Int64}}}, ::Base.EltypeUnknown, ::Base.HasShape{1}) at ./array.jl:619
 [21] collect at ./array.jl:544 [inlined]
 [22] |> at ./operators.jl:854 [inlined]
 [23] train!(::ResNet{Game}, ::Function, ::Base.Generator{Base.Generator{Array{Tuple{Array{Float32,2},Array{Float32,4},Array{Float32,2},Array{Float32,2},Array{Float32,2}},1},getfield(AlphaZero.Util, Symbol("##7#9")){getfield(AlphaZero, Symbol("##40#43")){AlphaZero.Trainer}}},getfield(AlphaZero.Util, Symbol("#process#10")){getfield(AlphaZero, Symbol("##41#44")){AlphaZero.Trainer},Int64}}, ::Float32) at /home/jonathan/AlphaZero.jl/src/Knet/KNets.jl:84
 [24] training_epoch!(::AlphaZero.Trainer) at /home/jonathan/AlphaZero.jl/src/Learning.jl:95
 [25] macro expansion at ./util.jl:213 [inlined]
 [26] learning!(::Env{Game,ResNet{Game},Board}, ::Session{Env{Game,ResNet{Game},Board}}) at /home/jonathan/AlphaZero.jl/src/Training.jl:91
 [27] top-level scope at /home/jonathan/.julia/packages/JuliaInterpreter/MXq3U/src/construct.jl:35
 [28] include at ./boot.jl:328 [inlined]
 [29] include_relative(::Module, ::String) at ./loading.jl:1094
 [30] include(::Module, ::String) at ./Base.jl:31
 [31] include(::String) at ./client.jl:431
 [32] top-level scope at none:0
 [33] eval at ./boot.jl:330 [inlined]
 [34] repleval(::Module, ::Expr) at /home/jonathan/.julia/packages/Atom/lBERI/src/repl.jl:149
 [35] (::getfield(Atom, Symbol("##172#174")){Module})() at /home/jonathan/.julia/packages/Atom/lBERI/src/repl.jl:171
 [36] with_logstate(::getfield(Atom, Symbol("##172#174")){Module}, ::Base.CoreLogging.LogState) at ./logging.jl:395
 [37] with_logger at ./logging.jl:491 [inlined]
 [38] evalrepl(::Module, ::String) at /home/jonathan/.julia/packages/Atom/lBERI/src/repl.jl:162
 [39] top-level scope at /home/jonathan/.julia/packages/Atom/lBERI/src/repl.jl:207
 [40] eval(::Module, ::Any) at ./boot.jl:330
 [41] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/REPL/src/REPL.jl:86
 [42] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/REPL/src/REPL.jl:118 [inlined]
 [43] (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})() at ./task.jl:268
@maleadt
Copy link
Member

maleadt commented Nov 13, 2019

That's strange, the backtrace points outside of the CuArrays allocator, so this means some other operation is secretly allocating and triggers an asynchronous OOM. You are using CuArrays <1.4, where FFT allocations weren't pool managed; are you computing FFTs somewhere?

@jonathan-laurent
Copy link
Author

jonathan-laurent commented Nov 13, 2019

@maleadt I do not thinking I am computing FFTs anywhere. My model is an assembly of standard layers that are available in both Flux and Knet. I am suspecting that the bug has to do with convolutional layers as I did not encounter the same issue with dense networks. Also, the same bug happens with both Knet and Flux.

I just set up a github branch to replicate the issue. To replicate:

git clone [email protected]:jonathan-laurent/AlphaZero.jl.git -b gpu-bug
cd AlphaZero.jl
julia --project games/mancala/test_learning.jl

If your GPU is more powerful, you may want to increase the size of the model by changing this line.
To solve the issue by adding explicit calls to the garbage collector, just replace this line by gc_every = 2000. Finally, to switch from Flux to Knet, you can uncomment lines 63-64 in AlphaZero.jl and comment lines 61-62.

To have a quick look at my ResNet model, see ResNet.jl.

Edit: if encountering a deserialization error when trying to run the replication experiment, remove the session-mancala-bug folder and run the script again. It will take a few minutes for the training data to be generated before learning starts.

@jonathan-laurent
Copy link
Author

@maleadt I was just thinking: would it be possible that convolutional layers are implemented with an FFT in both Flux and Knet? Then, if FFT allocations are not pooled before CuArrays 1.4, this might explain my issue and also JuliaGPU/CuArrays.jl#323, FluxML/Flux.jl#736 and JuliaGPU/CuArrays.jl#273 (all involve convolutions).

@maleadt maleadt changed the title GPU running out of memory unless explicit calls to the GC are made periodically CUDNN convolution allocates outside of the memory pool Nov 14, 2019
@maleadt
Copy link
Member

maleadt commented Nov 14, 2019

For the Knet case, it seems straightforward: the documentation for cudnnFindConvolutionForwardAlgorithm explicitly mentions it allocates.

Memory is allocated via cudaMalloc(). The performance metrics are returned in the user-allocated array of cudnnConvolutionFwdAlgoPerf_t. These metrics are written in a sorted fashion where the first element has the lowest compute time. The total number of resulting algorithms can be queried through the API cudnnGetConvolutionForwardAlgorithmMaxCount().

You should file an issue there, they should be using cudnnFindConvolutionForwardAlgorithmEx with an explicit workspace argument that uses memory from the pool.

For CuArrays, something else is going on.

@jonathan-laurent
Copy link
Author

For CuArrays, something else is going on.

Do you have any idea what this might be? Do you have any reason to believe that the problem must be on the CuArrays side, rather than on the conv implementation in FluxML/NNLib?

Anyway, please let me know if there is anything I can do to help.

@maleadt
Copy link
Member

maleadt commented Nov 18, 2019

Can you provide a MWE?

@jonathan-laurent
Copy link
Author

@maleadt I am working on it.
Replicating the Knet bug is easy and only requires a basic stress test.
The Flux/CuArrays bug is trickier to isolate though.

@jonathan-laurent
Copy link
Author

jonathan-laurent commented Nov 19, 2019

@maleadt I think I figured out what is happening.
I think the problem is that the CuArrays allocator will sometimes allocate all available memory on GPU, leaving no space to store program code or accommodate CUDA overheads.

Indeed, my original code makes two different GPU-accelerated computations in sequence. After the first computation, there is only 2MiB of memory left (according to nvidia-smi). This memory could be freed but my hypothesis is that before the allocator is given an opportunity to do so, an attempt is made to load the code of the second computation, which fails because there isn't enough memory left.

As expected, the problem disappears if I put a limit on the total memory usage:

Env["CUARRAYS_MEMORY_LIMIT"] = 7_500_000_000
using CuArrays
...

Does this make sense?
If so, then it would be nice to have a way to clear the GPU memory pool manually. I saw there used to be a CuArrays.clearpool function but it isn't available anymore. May I ask why?

@maleadt
Copy link
Member

maleadt commented Nov 19, 2019

I think the problem is that the CuArrays allocator will sometimes allocate all available memory on GPU, leaving no space to store program code or accommodate CUDA overheads.

Yes, I know that's what's happening, and that's why I renamed this issue. We expect all allocations to go through the CuArrays memory pool, or else we can't cache memory (breaking the whole concept of a memory pool). So we need to figure out where those allocations come from -- in the Knet case it's clear, cudnnFindConvolutionForwardAlgorithm, but that function isn't being anywhere called in CuArrays/Flux.

@maleadt maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020
@maleadt maleadt added bug Something isn't working cuda array Stuff about CuArray. labels May 27, 2020
@maleadt
Copy link
Member

maleadt commented Mar 2, 2021

External allocations should be handled a lot better now.

@maleadt maleadt closed this as completed Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda array Stuff about CuArray.
Projects
None yet
Development

No branches or pull requests

2 participants