Protect from memory issues #27

vlimant · 2019-09-27T11:21:13Z

hard to say how to solve this, or whether it has to be solve, but running

mpirun --prefix /opt/openmpi-3.1.0 -np 7 --tag-output singularity exec --nv -B /imdata/ -B /storage/ /storage/group/gpu/software/singularity/ibanks/edge.simg python3 OptimizationDriver.py --loss categorical_crossentropy --epochs 1000 --batch 200 --model examples/example_jedi_torch.py --early "val_loss,~<,4" --checkpoint jedi-3 --mode gem --worker-optimizer adam --cache /imdata/ --block-size 3

I get

[1,2]:Traceback (most recent call last):
[1,2]: File "OptimizationDriver.py", line 325, in
[1,2]: block.run()
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/optimize/process_block.py", line 157, in run
[1,2]: fom = self.train_model()
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/optimize/process_block.py", line 127, in train_model
[1,2]: checkpoint_interval=self.checkpoint_interval)
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 420, in init
[1,2]: checkpoint=checkpoint, checkpoint_interval=checkpoint_interval)
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 165, in init
[1,2]: self.make_comms(comm)
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 266, in make_comms
[1,2]: checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 493, in init
[1,2]: checkpoint=checkpoint, checkpoint_interval=checkpoint_interval )
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 121, in init
[1,2]: self.train()
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 556, in train
[1,2]: train_metrics = self.model.train_on_batch( x=batch[0], y=batch[1] )
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/train/model.py", line 314, in train_on_batch
[1,2]: pred = self.model.forward(x)
[1,2]: File "/nfshome/vlimant/NNLO/examples/example_jedi_torch.py", line 74, in forward
[1,2]: E = torch.transpose(E, 1, 2).contiguous()
[1,2]:RuntimeError: CUDA out of memory. Tried to allocate 1.67 GiB (GPU 0; 7.93 GiB total capacity; 4.98 GiB already allocated; 1.09 GiB free; 1.33 GiB cached)

vloncar · 2019-09-27T11:28:33Z

Maybe we can avoid this by making sure no two (or more) of our processes are on the same gpu.

vlimant · 2019-10-07T12:42:21Z

looks like it's the memory that is not freed
https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache
trying ff921eb although the previous models might have to be explicitly deleted. Hard to debug with not enough GPU available

vlimant mentioned this issue Sep 27, 2019

catch train failure #28

Open

vlimant mentioned this issue Oct 7, 2019

attempt to clear memory #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protect from memory issues #27

Protect from memory issues #27

vlimant commented Sep 27, 2019

vloncar commented Sep 27, 2019

vlimant commented Oct 7, 2019

Protect from memory issues #27

Protect from memory issues #27

Comments

vlimant commented Sep 27, 2019

vloncar commented Sep 27, 2019

vlimant commented Oct 7, 2019