Sudden memory leak when training on GPU over many epochs #736

aterenin · 2019-04-15T19:46:53Z

Self-contained MWE below. On my system, this will run for about 40 epochs with the following output from nvidia-smi.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     50425      C   julia-1.1.0/bin/julia                       1713MiB |
+-----------------------------------------------------------------------------+

And then suddenly will give the following output.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     44870      C   julia-1.1.0/bin/julia                      11997MiB |
+-----------------------------------------------------------------------------+

I stress that the shift in memory usage is sudden - the model will use ~2gb over many epochs, before deciding that it needs to immediately fill all available memory. Note that PIDs above are different because I killed the first process and restarted training.

This behavior occurs in other networks and prevents me from training large-scale models overnight, and also prevents other people from using the same machine for their tasks. I do not know Flux/Tracker/CuArrays internals enough to speculate on what might be going on, but would love to help debug.

using CuArrays
using Flux
using Flux: train!, onehotbatch, @epochs
using MLDatasets

function make_batches(data::T, batch_size::Integer)::Array{T} where T <: Tuple{AbstractArray,AbstractArray}
  batches = Vector{T}()
  idx_shuffled = Flux.Random.randperm(size(data[1])[end])
  for idx in Iterators.partition(idx_shuffled, batch_size)
    x = selectdim(data[1], length(size(data[1])), idx)
    y = selectdim(data[2], length(size(data[2])), idx)
    push!(batches, (x,y))
  end
  batches
end

data = MNIST.traindata() |>
  x -> (reshape(Array{Float32}(x[1]),(28,28,1,:)),x[2]) |>
  x -> make_batches(x,16) |>
  x -> map(y->(y[1], onehotbatch(y[2],0:9)),x) |>
  x -> map(y->gpu.(y),x)

CuArrays.allowscalar(false)

network = Chain(
    Conv((3,3),1=>16,relu;pad=(1,1)),
    BatchNorm(16),
    Conv((3,3),16=>16;pad=(1,1)),
    MeanPool((3,3); pad=(1,1)),
    Conv((3,3),16=>32,relu;pad=(1,1)),
    BatchNorm(32),
    Conv((3,3),32=>32;pad=(1,1)),
    MeanPool((3,3); pad=(1,1)),
    Conv((3,3),32=>64,relu;pad=(1,1)),
    BatchNorm(64),
    Conv((3,3),64=>64;pad=(1,1)),
    MeanPool((3,3); pad=(1,1)),
    Conv((3,3),64=>128,relu;pad=(1,1)),
    BatchNorm(128),
    Conv((3,3),128=>128;pad=(1,1)),
    MeanPool((4,4); pad=(1,1)),
    x -> reshape(x,(:,size(x)[end])),
    Dense(128,10),
    softmax
  ) |> gpu

function loss(x::AbstractArray, y::AbstractArray)::Real
  y_predicted = network(x)
  R = 0.01f0 * sum(sum(w.^2) for w in params(network))
  Flux.crossentropy(y_predicted, y) + R
end

optimizer = ADAM()

@epochs 1000 train!(loss, params(network), data, optimizer)

The text was updated successfully, but these errors were encountered:

ToucheSir · 2021-06-14T23:31:59Z

Since the linked JuliaGPU/CUDA.jl#137 was closed, perhaps we should close this too? Flux leans almost exclusively on CUDA for memory management, so there's not too much actionable here without big interface changes (e.g. doing everything in-place).

jmulcahy mentioned this issue Aug 1, 2019

Flux Allocates Excessively #828

Open

jonathan-laurent mentioned this issue May 27, 2020

CUDNN convolution allocates outside of the memory pool JuliaGPU/CUDA.jl#111

Closed

aterenin mentioned this issue May 27, 2020

Allocator very slow to reclaim memory after running for sufficiently long JuliaGPU/CUDA.jl#137

Closed

CarloLucibello closed this as completed Jun 15, 2021

CarloLucibello mentioned this issue Nov 9, 2024

cuda gpu memory usage increasing in time #2523

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden memory leak when training on GPU over many epochs #736

Sudden memory leak when training on GPU over many epochs #736

aterenin commented Apr 15, 2019

ToucheSir commented Jun 14, 2021

Sudden memory leak when training on GPU over many epochs #736

Sudden memory leak when training on GPU over many epochs #736

Comments

aterenin commented Apr 15, 2019

ToucheSir commented Jun 14, 2021