Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SyntaxNet fails with CUDA out of memory #173

Closed
orionr opened this issue Jun 2, 2016 · 16 comments
Closed

SyntaxNet fails with CUDA out of memory #173

orionr opened this issue Jun 2, 2016 · 16 comments
Assignees
Labels
stat:awaiting response Waiting on input from the contributor

Comments

@orionr
Copy link
Contributor

orionr commented Jun 2, 2016

SyntaxNet

I'm running on Ubuntu 16.04 with TensorFlow and models both built from git master branchs. Most of the models are working for me, but SyntaxNet fails with a CUDA out of memory error even though the card has 8GB total and nothing else is using those resources. Note that I'm on CUDA 8.0 RC here, but I doubt it makes a difference.

Output is as follows

~/git/models/syntaxnet$ echo 'Bob brought the pizza to Alice.' | syntaxnet/demo.sh
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
...
I syntaxnet/term_frequency_map.cc:101] Loaded 49 terms from syntaxnet/models/parsey_mcparseface/tag-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from syntaxnet/models/parsey_mcparseface/word-map.
I tensorflow/core/common_runtime/gpu/gpu_device.cc:783] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:01:00.0)
INFO:tensorflow:Building training network with parameters: feature_sizes: [12 20 20] domain_sizes: [   49    51 64038]
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 6.80G (7304685312 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 6.12G (6574216704 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 5.51G (5916794880 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.96G (5325115392 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.46G (4792603648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.02G (4313342976 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 3.62G (3882008576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 3.25G (3493807616 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
...
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.digit input.hyphen; input.prefix(length="2") input(1).prefix(length="2") input(2).prefix(length="2") input(3).prefix(length="2") input(-1).prefix(length="2") input(-2).prefix(length="2") input(-3).prefix(length="2") input(-4).prefix(length="2"); input.prefix(length="3") input(1).prefix(length="3") input(2).prefix(length="3") input(3).prefix(length="3") input(-1).prefix(length="3") input(-2).prefix(length="3") input(-3).prefix(length="3") input(-4).prefix(length="3"); input.suffix(length="2") input(1).suffix(length="2") input(2).suffix(length="2") input(3).suffix(length="2") input(-1).suffix(length="2") input(-2).suffix(length="2") input(-3).suffix(length="2") input(-4).suffix(length="2"); input.suffix(length="3") input(1).suffix(length="3") input(2).suffix(length="3") input(3).suffix(length="3") input(-1).suffix(length="3") input(-2).suffix(length="3") input(-3).suffix(length="3") input(-4).suffix(length="3"); input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: other;prefix2;prefix3;suffix2;suffix3;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 8;16;16;16;16;64
I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from syntaxnet/models/parsey_mcparseface/word-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from syntaxnet/models/parsey_mcparseface/word-map.
INFO:tensorflow:Total processed documents: 0
INFO:tensorflow:Total processed documents: 0
INFO:tensorflow:Read 0 documents

It also seems weird that SyntaxNet requires the tensorflow submodule, since I've actually checked out all of that (including dependencies) and built it in a different location. Would be nice if that wasn't needed, but not a big deal.

Any thoughts out there? Much appreciated.

@s0okiym
Copy link

s0okiym commented Jun 8, 2016

I met the same error. cuda_driver.cc:965 CUDA_ERROR_OUT_OF_MEMORY when running the distributed mnist code.

@orionr orionr changed the title SyntaxNet fails with CUDA out of memory on GTX 1080 SyntaxNet fails with CUDA out of memory Jun 9, 2016
@orionr
Copy link
Contributor Author

orionr commented Jun 9, 2016

Removed the GTX 1080 in title, since this might be experienced with other cards.

@calberti
Copy link
Contributor

@orionr were you able to make any progress on this? I don't have much experience running SyntaxNet on different GPUs, but if you figured out a solution that might be useful to others.

@borisstock
Copy link

This issue can be fixed, by configuring thetf.Session with the following:

config.gpu_options.allow_growth = True

This seems to fix the problem for me!

@zheng-xq
Copy link

Does the program continue in spite of the errors? I think the errors shown here are harmless.

TensorFlow has its own BFC allocator. It asks a large chunk of memory from the Cuda driver, and does suballocate. If it runs out of memory, it double the sizes each time it asks from Cuda. When it runs out, it starts a final backpedal and starts to asks smaller amount of memory, and eventually settles on the largest memory that it can successfully gets.

This would be fatal if it fails to allocate a memory that is bigger than what has been asked. Normally it program would terminate itself at that point.

If you are really running out of memory, you can try to reduce the batch_size. Note that many of those models are developed with GPU with 12GB of memory. If it runs out of memory for GPU with less memory, reducing batch size could be a way to go.

@borisstock
Copy link

borisstock commented Jun 28, 2016

In my case the program did not continue. It crashed when it tried to allocate more than the 12 GB of my Titan X. I think somewhere there is an error, that it thinks that it did run out of memory and it tries and tries to allocate more and more. And somehow the "allow_groth" option fixed it for me (Cuda 7.5, CuDNN 5 on OS X). And I'm pretty sure 12 GB are more than enough for simply running the "demo.sh" script of PMCPF.

@orionr
Copy link
Contributor Author

orionr commented Jun 28, 2016

Thanks Boris. I don't have access to the machine until next week but I'll try it then.

On Jun 28, 2016, at 2:35 PM, Boris Stock [email protected] wrote:

In my case the program did not continue. It crashed when it tried to allocate more than the 12 GB of my Titan X. I think somewhere there is an error, that is thinks that it did run of memory and it tries and tries to allocate more and more. And somehow the "allow_groth" option fixed it for me (Cuda 7.5, CuDNN 5 on OS X).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@aselle aselle removed the triaged label Jul 28, 2016
@gunan gunan added the stat:awaiting response Waiting on input from the contributor label Aug 15, 2016
@todtom
Copy link

todtom commented Aug 21, 2016

@orionr Hi, I have built syntaxnet succesefully, but it seems to work on cpus, rather than gpu. Could you tell me how to make it work on the gpus?

@orionr
Copy link
Contributor Author

orionr commented Aug 25, 2016

As a note, after updating tensorflow and models git repos and downgrading bazel to 0.2.2b everything works perfectly!

~/git/models/syntaxnet$ echo 'Bob brought the pizza to Alice.' | syntaxnet/demo.sh
Input: Bob brought the pizza to Alice .
Parse:
brought VBD ROOT
 +-- Bob NNP nsubj
 +-- pizza NN dobj
 |   +-- the DT det
 +-- to IN prep
 |   +-- Alice NNP pobj
 +-- . . punct

@todtom - You'll want to run ./configure inside the models/syntaxnet/tensorflow/ directory. Also make sure you have an NVIDIA card with modern CUDA capabilities. Good luck.

@Shnurre
Copy link

Shnurre commented Sep 8, 2016

I am having the same error as is described by @orionr in the thread post.

I have Ubuntu 15.10, CUDA 7.5, cuDNN 4.0.7 and I was trying to build syntaxnet from up-to-date models git repos with bazel 0.2.2b as is described here #248 by @David-Ba . I also tried various other versions of bazel and cuDNN 5, but got the same error.
I should also be noted that Syntaxnet without GPU support builds on my machine correctly and works as it is supposed to.

It appears that I did not manage to implement successfully the solution proposed here by @borisstock . I added config.gpu_options.allow_growth = True to all the files containing other modifications of config.gpu_options - files tensorflow/tensorflow/python/framework/test_util.py, tensorflow/tensorflow/python/kernel_tests/sparse_xent_op_test.py and tensorflow/tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_test.py. I seems though that I missed something essential.

Could please @orionr , @borisstock or anyone else who managed to solve this problem specify where exactly should config.gpu_options.allow_growth = True be added?

@orionr
Copy link
Contributor Author

orionr commented Sep 8, 2016

I actually didn't need to use allow_growth = True after updating all of the git repos and downgrading bazel. @Shnurre - what GPU are you using? Also make sure you do a bazel clean before the rebuild. I even removed my _python_build directory inside tensorflow and recreated it each time just to be safe.

@Shnurre
Copy link

Shnurre commented Sep 8, 2016

@orionr , thank you for your quick response.
I have GTX 970 though I don't think this error is card-specific.

Yes, I always perform bazel clean before rebuilding. I also tried removing and downloding fresh models repo, manually removing .cache/bazel and completely reinstalling several versions of bazel, but nothing worked for me so far

@Shnurre
Copy link

Shnurre commented Sep 15, 2016

@borisstock , @calberti , @orionr I am not sure if you are the right people to ask( if you are not, I am sorry for disturbing you), but should I reopen this issue or maybe open a new one?
I am having exactly the same problems as described here by @orionr but changing bazel version and updating the repos didn't help me.
I am still hoping that @borisstock or anyone else who successfully managed to implement his solution would be able to clarify the solution.

@utkrist
Copy link

utkrist commented Mar 24, 2017

In models/syntaxnet/syntaxnet/parser_eval.py, I made this change and it worked

gpu_opt = tf.GPUOptions(allow_growth=True)
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_opt)) as sess:
    Eval(sess)

@irfan-zoefit
Copy link

I'm having the same issue and don't know where to put the value.

config.gpu_options.allow_growth=true
Would you specify the file.

@zerodarkzone
Copy link

Hi,
I keep getting the CUDA_OUT_OF_MEMORY error. I already tried the fix proposed here but it doesn't work. I compiled it with bazel 0.5.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Waiting on input from the contributor
Projects
None yet
Development

No branches or pull requests