Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect from memory issues #27

Open
vlimant opened this issue Sep 27, 2019 · 2 comments
Open

Protect from memory issues #27

vlimant opened this issue Sep 27, 2019 · 2 comments

Comments

@vlimant
Copy link
Owner

vlimant commented Sep 27, 2019

hard to say how to solve this, or whether it has to be solve, but running

mpirun --prefix /opt/openmpi-3.1.0 -np 7 --tag-output singularity exec --nv -B /imdata/ -B /storage/ /storage/group/gpu/software/singularity/ibanks/edge.simg python3 OptimizationDriver.py --loss categorical_crossentropy --epochs 1000 --batch 200 --model examples/example_jedi_torch.py --early "val_loss,~<,4" --checkpoint jedi-3 --mode gem --worker-optimizer adam --cache /imdata/ --block-size 3

I get

[1,2]:Traceback (most recent call last):
[1,2]: File "OptimizationDriver.py", line 325, in
[1,2]: block.run()
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/optimize/process_block.py", line 157, in run
[1,2]: fom = self.train_model()
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/optimize/process_block.py", line 127, in train_model
[1,2]: checkpoint_interval=self.checkpoint_interval)
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 420, in init
[1,2]: checkpoint=checkpoint, checkpoint_interval=checkpoint_interval)
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 165, in init
[1,2]: self.make_comms(comm)
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 266, in make_comms
[1,2]: checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 493, in init
[1,2]: checkpoint=checkpoint, checkpoint_interval=checkpoint_interval )
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 121, in init
[1,2]: self.train()
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 556, in train
[1,2]: train_metrics = self.model.train_on_batch( x=batch[0], y=batch[1] )
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/train/model.py", line 314, in train_on_batch
[1,2]: pred = self.model.forward(x)
[1,2]: File "/nfshome/vlimant/NNLO/examples/example_jedi_torch.py", line 74, in forward
[1,2]: E = torch.transpose(E, 1, 2).contiguous()
[1,2]:RuntimeError: CUDA out of memory. Tried to allocate 1.67 GiB (GPU 0; 7.93 GiB total capacity; 4.98 GiB already allocated; 1.09 GiB free; 1.33 GiB cached)

@vloncar
Copy link
Collaborator

vloncar commented Sep 27, 2019

Maybe we can avoid this by making sure no two (or more) of our processes are on the same gpu.

@vlimant
Copy link
Owner Author

vlimant commented Oct 7, 2019

looks like it's the memory that is not freed
https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache
trying ff921eb although the previous models might have to be explicitly deleted. Hard to debug with not enough GPU available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants