Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden memory leak when training on GPU over many epochs #736

Closed
aterenin opened this issue Apr 15, 2019 · 1 comment
Closed

Sudden memory leak when training on GPU over many epochs #736

aterenin opened this issue Apr 15, 2019 · 1 comment

Comments

@aterenin
Copy link
Contributor

Self-contained MWE below. On my system, this will run for about 40 epochs with the following output from nvidia-smi.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     50425      C   julia-1.1.0/bin/julia                       1713MiB |
+-----------------------------------------------------------------------------+

And then suddenly will give the following output.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     44870      C   julia-1.1.0/bin/julia                      11997MiB |
+-----------------------------------------------------------------------------+

I stress that the shift in memory usage is sudden - the model will use ~2gb over many epochs, before deciding that it needs to immediately fill all available memory. Note that PIDs above are different because I killed the first process and restarted training.

This behavior occurs in other networks and prevents me from training large-scale models overnight, and also prevents other people from using the same machine for their tasks. I do not know Flux/Tracker/CuArrays internals enough to speculate on what might be going on, but would love to help debug.

using CuArrays
using Flux
using Flux: train!, onehotbatch, @epochs
using MLDatasets

function make_batches(data::T, batch_size::Integer)::Array{T} where T <: Tuple{AbstractArray,AbstractArray}
  batches = Vector{T}()
  idx_shuffled = Flux.Random.randperm(size(data[1])[end])
  for idx in Iterators.partition(idx_shuffled, batch_size)
    x = selectdim(data[1], length(size(data[1])), idx)
    y = selectdim(data[2], length(size(data[2])), idx)
    push!(batches, (x,y))
  end
  batches
end

data = MNIST.traindata() |>
  x -> (reshape(Array{Float32}(x[1]),(28,28,1,:)),x[2]) |>
  x -> make_batches(x,16) |>
  x -> map(y->(y[1], onehotbatch(y[2],0:9)),x) |>
  x -> map(y->gpu.(y),x)

CuArrays.allowscalar(false)

network = Chain(
    Conv((3,3),1=>16,relu;pad=(1,1)),
    BatchNorm(16),
    Conv((3,3),16=>16;pad=(1,1)),
    MeanPool((3,3); pad=(1,1)),
    Conv((3,3),16=>32,relu;pad=(1,1)),
    BatchNorm(32),
    Conv((3,3),32=>32;pad=(1,1)),
    MeanPool((3,3); pad=(1,1)),
    Conv((3,3),32=>64,relu;pad=(1,1)),
    BatchNorm(64),
    Conv((3,3),64=>64;pad=(1,1)),
    MeanPool((3,3); pad=(1,1)),
    Conv((3,3),64=>128,relu;pad=(1,1)),
    BatchNorm(128),
    Conv((3,3),128=>128;pad=(1,1)),
    MeanPool((4,4); pad=(1,1)),
    x -> reshape(x,(:,size(x)[end])),
    Dense(128,10),
    softmax
  ) |> gpu

function loss(x::AbstractArray, y::AbstractArray)::Real
  y_predicted = network(x)
  R = 0.01f0 * sum(sum(w.^2) for w in params(network))
  Flux.crossentropy(y_predicted, y) + R
end

optimizer = ADAM()

@epochs 1000 train!(loss, params(network), data, optimizer)
@ToucheSir
Copy link
Member

Since the linked JuliaGPU/CUDA.jl#137 was closed, perhaps we should close this too? Flux leans almost exclusively on CUDA for memory management, so there's not too much actionable here without big interface changes (e.g. doing everything in-place).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants