training and loading dataset stuck #772

ifmaq1 · 2021-02-11T07:24:21Z

❓ Questions and Help

I am training a movie_mcan model and tried training a pythia model too. I have grid features of coco dataset(with resnet-50 architecture) saved on my storage. These are hundreds of gbs in total. with 2 gpu I waited 8 hours so i could load these features and atleast print something. But it didn't as it was stuck on loading datasets and couldn't go to starting training

my environment details

PyTorch version: 1.6.0+cu101

Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.7 LTS
GCC version: (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
CMake version: version 3.18.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80

Nvidia driver version: 430.64
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.2
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.2
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.6.0+cu101
[pip3] torchtext==0.5.0
[pip3] torchvision==0.7.0+cu101
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.0.130 0
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py36he8ac12f_0
[conda] mkl_fft 1.2.0 py36h23d657b_0
[conda] mkl_random 1.1.1 py36h0573a6f_0
[conda] numpy 1.19.2 py36h54aff64_0
[conda] numpy-base 1.19.2 py36hfa32c7d_0
[conda] torch 1.4.0+cu100 pypi_0 pypi
[conda] torchtext 0.5.0 pypi_0 pypi
[conda] torchvision 0.7.0+cu101 pypi_0 pypi

with 2 gpus

i am unable to understand how to resolve this issue. I have tried decreasing num_worker but it didn't work.

The text was updated successfully, but these errors were encountered:

hackgoofer · 2021-02-12T01:22:15Z

Hi! You can try to set the logger_level=debug to see what is going on. It will print more useful error messages. On my end, it has also sporadically happened. I have not been able to pinpoint the issue. We will continue to keep an eye out to monitor this.

ifmaq1 · 2021-02-12T19:44:15Z

@ytsheng here is my result when I interrupt the training. Even with 0 num_worker.

CUDA_VISIBLE_DEVICES=0,1 mmf_run config=projects/movie_mcan/configs/vqa2/defaults.yaml \model=movie_mcan \dataset=vqa2 \run_type=train \training.num_workers=0 \training.logger_level=debug

**2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option config to projects/movie_mcan/configs/vqa2/defaults.yaml
2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option model to movie_mcan
2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option datasets to vqa2
2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option run_type to train
2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option training.num_workers to 0
2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option training.logger_level to debug
2021-02-13T00:38:16 | mmf.utils.distributed: Distributed Init (Rank 0): tcp://localhost:19627
2021-02-13T00:38:17 | mmf.utils.distributed: Distributed Init (Rank 1): tcp://localhost:19627
2021-02-13T00:38:17 | mmf.utils.distributed: Initialized Host 127.0.0.1localhost.localdomainlocalhost as Rank 1
2021-02-13T00:38:17 | mmf.utils.distributed: Initialized Host 127.0.0.1localhost.localdomainlocalhost as Rank 0
2021-02-13T00:38:19 | mmf: Logging to: ./save/train.log
2021-02-13T00:38:19 | mmf_cli.run: Namespace(config_override=None, local_rank=None, opts=['config=projects/movie_mcan/configs/vqa2/defaults.yaml', 'model=movie_mcan', 'dataset=vqa2', 'run_type=train', 'training.num_workers=0', 'training.logger_level=debug'])
2021-02-13T00:38:19 | mmf_cli.run: Torch version: 1.6.0+cu101
2021-02-13T00:38:19 | mmf.utils.general: CUDA Device 0 is: Tesla K80
2021-02-13T00:38:19 | mmf_cli.run: Using seed 19127618

2021-02-13T00:38:19 | mmf.trainers.mmf_trainer: Loading datasets

On interuption

^CTraceback (most recent call last):
File "/home/anaconda3/envs/mmf/bin/mmf_run", line 33, in
sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')())
File "/home/new-mmf/mmf-master/mmf_cli/run.py", line 118, in run
nprocs=config.distributed.world_size,
File "/home/anaconda3/envs/mmf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/anaconda3/envs/mmf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/anaconda3/envs/mmf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 78, in join
timeout=timeout,
File "/home/anaconda3/envs/mmf/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/home/anaconda3/envs/mmf/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/anaconda3/envs/mmf/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

apsdehal · 2021-02-12T21:13:51Z

Hi @ifrahmaqsood!

We will need some more debugging information from you. Have you tried running some other framework on your machines? For example, can you try running pytorch's imagenet example to see if it works fine for you. https://github.com/pytorch/examples/blob/master/imagenet/main.py

ifmaq1 · 2021-02-14T10:08:48Z

@apsdehal I tried training https://github.com/facebookresearch/grid-feats-vqa with detectron2. With 2 GPUS, this model is also not working. What could be the possible reason? Is there some problem with Pytorch?

the training command
/home/anaconda3/envs/detectron2/bin/python train_net.py --num-gpus 2 --config-file configs/R-50-grid.yaml

it just displays
Command Line Args: Namespace(config_file='configs/R-50-grid.yaml', dist_url='tcp://127.0.0.1:50170', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False)

and looks the system is stuck here as, with 1 gpu it starts training

ifmaq1 · 2021-02-16T07:12:16Z

@apsdehal Any solution?

vedanuj · 2021-02-16T07:30:36Z

@ifrahmaqsood Which version of NCCL do you have, as I suspect this is a NCCL communication error?

Since it is happening for you on other frameworks as well, we recommend you raise the issue on Pytorch forums/issue, as it is not specific to MMF.

ifmaq1 · 2021-02-16T08:03:39Z

@vedanuj
->torch.cuda.nccl.version()
2408

apsdehal · 2021-02-17T08:11:30Z

@ifrahmaqsood as @vedanuj suggested, I also think you should raise this to PyTorch forums/issues. I would suggest running the pytorch imagenet example I suggested before to confirm and then post your findings either on PyTorch issues or forums. This is no longer related to MMF rather your PyTorch setup. As far as I can tell, dist.barrier call is getting stuck.

ifmaq1 · 2021-03-12T07:53:32Z

Thank you for the responses. Yes I had to change the backend from nccl to gloo and it worked

g-luo · 2021-04-15T06:11:03Z

I just wanted to add that I was having this same problem and changing backend from nccl to gloo worked, ie adding to the config

distributed:
   backend: gloo

(for anyone who takes look at this thread)

ifmaq1 closed this as completed Mar 12, 2021

g-luo mentioned this issue Apr 15, 2021

mmf trainer loading datasets hangs with any more than 2 gpus #756

Closed

CCYChongyanChen mentioned this issue Apr 16, 2021

Unexpected bus error. This might be caused by insufficient shared memory #877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training and loading dataset stuck #772

training and loading dataset stuck #772

ifmaq1 commented Feb 11, 2021 •

edited

Loading

hackgoofer commented Feb 12, 2021

ifmaq1 commented Feb 12, 2021 •

edited

Loading

apsdehal commented Feb 12, 2021

ifmaq1 commented Feb 14, 2021

ifmaq1 commented Feb 16, 2021

vedanuj commented Feb 16, 2021

ifmaq1 commented Feb 16, 2021

apsdehal commented Feb 17, 2021

ifmaq1 commented Mar 12, 2021

g-luo commented Apr 15, 2021

training and loading dataset stuck #772

training and loading dataset stuck #772

Comments

ifmaq1 commented Feb 11, 2021 • edited Loading

❓ Questions and Help

hackgoofer commented Feb 12, 2021

ifmaq1 commented Feb 12, 2021 • edited Loading

CUDA_VISIBLE_DEVICES=0,1 mmf_run config=projects/movie_mcan/configs/vqa2/defaults.yaml \model=movie_mcan \dataset=vqa2 \run_type=train \training.num_workers=0 \training.logger_level=debug

2021-02-13T00:38:19 | mmf.trainers.mmf_trainer: Loading datasets

On interuption

apsdehal commented Feb 12, 2021

ifmaq1 commented Feb 14, 2021

ifmaq1 commented Feb 16, 2021

vedanuj commented Feb 16, 2021

ifmaq1 commented Feb 16, 2021

apsdehal commented Feb 17, 2021

ifmaq1 commented Mar 12, 2021

g-luo commented Apr 15, 2021

ifmaq1 commented Feb 11, 2021 •

edited

Loading

ifmaq1 commented Feb 12, 2021 •

edited

Loading