-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDNN convolution allocates outside of the memory pool #111
Comments
That's strange, the backtrace points outside of the CuArrays allocator, so this means some other operation is secretly allocating and triggers an asynchronous OOM. You are using CuArrays <1.4, where FFT allocations weren't pool managed; are you computing FFTs somewhere? |
@maleadt I do not thinking I am computing FFTs anywhere. My model is an assembly of standard layers that are available in both Flux and Knet. I am suspecting that the bug has to do with convolutional layers as I did not encounter the same issue with dense networks. Also, the same bug happens with both Knet and Flux. I just set up a github branch to replicate the issue. To replicate:
If your GPU is more powerful, you may want to increase the size of the model by changing this line. To have a quick look at my ResNet model, see Edit: if encountering a deserialization error when trying to run the replication experiment, remove the |
@maleadt I was just thinking: would it be possible that convolutional layers are implemented with an FFT in both Flux and Knet? Then, if FFT allocations are not pooled before CuArrays 1.4, this might explain my issue and also JuliaGPU/CuArrays.jl#323, FluxML/Flux.jl#736 and JuliaGPU/CuArrays.jl#273 (all involve convolutions). |
For the Knet case, it seems straightforward: the documentation for
You should file an issue there, they should be using For CuArrays, something else is going on. |
Do you have any idea what this might be? Do you have any reason to believe that the problem must be on the CuArrays side, rather than on the Anyway, please let me know if there is anything I can do to help. |
Can you provide a MWE? |
@maleadt I am working on it. |
@maleadt I think I figured out what is happening. Indeed, my original code makes two different GPU-accelerated computations in sequence. After the first computation, there is only 2MiB of memory left (according to As expected, the problem disappears if I put a limit on the total memory usage:
Does this make sense? |
Yes, I know that's what's happening, and that's why I renamed this issue. We expect all allocations to go through the CuArrays memory pool, or else we can't cache memory (breaking the whole concept of a memory pool). So we need to figure out where those allocations come from -- in the Knet case it's clear, cudnnFindConvolutionForwardAlgorithm, but that function isn't being anywhere called in CuArrays/Flux. |
External allocations should be handled a lot better now. |
When training a ResNet using either Flux or Knet, I am encountering runtime errors due to the GPU running out of memory. Typical errors include:
with Flux and
with Knet (full stack traces are available below). The only way I found to eliminate these errors was to insert explicit periodic calls to the garbage collector in my code using
GC.gc()
.My question is the following: why are these calls necessary and why isn't CuArrays calling the GC automatically when running out of GPU memory?
Also, note that after inserting explicit calls to the garbage collector, I am still encountering the seemingly common issue (https://github.com/JuliaGPU/CuArrays.jl/issues/323, FluxML/Flux.jl#736, https://github.com/JuliaGPU/CuArrays.jl/issues/273) where training slows down considerably after a few training epochs (~4x performance hit in my case).
Details
The two runtime errors below arise after a few seconds, which corresponds to evaluating a Resnet (10 convolutional layers, 500K parameters) on about 40,000 samples (with minibatches of size 64). A quick back-of-the-envelope calculation shows that running inference on a single sample should allocate at most 0.2MB and so this roughly corresponds to the time it should take to fill my 8GB GPU memory if the GC is never called.
I am currently working on providing an easy way to replicate this result. In the meantime, here are two typical stack traces.
Flux stack trace
Knet stack trace
The text was updated successfully, but these errors were encountered: