Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training and loading dataset stuck #772

Closed
ifmaq1 opened this issue Feb 11, 2021 · 10 comments
Closed

training and loading dataset stuck #772

ifmaq1 opened this issue Feb 11, 2021 · 10 comments

Comments

@ifmaq1
Copy link

ifmaq1 commented Feb 11, 2021

❓ Questions and Help

I am training a movie_mcan model and tried training a pythia model too. I have grid features of coco dataset(with resnet-50 architecture) saved on my storage. These are hundreds of gbs in total. with 2 gpu I waited 8 hours so i could load these features and atleast print something. But it didn't as it was stuck on loading datasets and couldn't go to starting training

my environment details

PyTorch version: 1.6.0+cu101

Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.7 LTS
GCC version: (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
CMake version: version 3.18.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80

Nvidia driver version: 430.64
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.2
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.2
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.6.0+cu101
[pip3] torchtext==0.5.0
[pip3] torchvision==0.7.0+cu101
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.0.130 0
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py36he8ac12f_0
[conda] mkl_fft 1.2.0 py36h23d657b_0
[conda] mkl_random 1.1.1 py36h0573a6f_0
[conda] numpy 1.19.2 py36h54aff64_0
[conda] numpy-base 1.19.2 py36hfa32c7d_0
[conda] torch 1.4.0+cu100 pypi_0 pypi
[conda] torchtext 0.5.0 pypi_0 pypi
[conda] torchvision 0.7.0+cu101 pypi_0 pypi

with 2 gpus
image

i am unable to understand how to resolve this issue. I have tried decreasing num_worker but it didn't work.

@hackgoofer
Copy link
Contributor

Hi! You can try to set the logger_level=debug to see what is going on. It will print more useful error messages. On my end, it has also sporadically happened. I have not been able to pinpoint the issue. We will continue to keep an eye out to monitor this.

@ifmaq1
Copy link
Author

ifmaq1 commented Feb 12, 2021

@ytsheng here is my result when I interrupt the training. Even with 0 num_worker.

CUDA_VISIBLE_DEVICES=0,1 mmf_run config=projects/movie_mcan/configs/vqa2/defaults.yaml \model=movie_mcan \dataset=vqa2 \run_type=train \training.num_workers=0 \training.logger_level=debug

**2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option config to projects/movie_mcan/configs/vqa2/defaults.yaml
2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option model to movie_mcan
2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option datasets to vqa2
2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option run_type to train
2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option training.num_workers to 0
2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option training.logger_level to debug
2021-02-13T00:38:16 | mmf.utils.distributed: Distributed Init (Rank 0): tcp://localhost:19627
2021-02-13T00:38:17 | mmf.utils.distributed: Distributed Init (Rank 1): tcp://localhost:19627
2021-02-13T00:38:17 | mmf.utils.distributed: Initialized Host 127.0.0.1localhost.localdomainlocalhost as Rank 1
2021-02-13T00:38:17 | mmf.utils.distributed: Initialized Host 127.0.0.1localhost.localdomainlocalhost as Rank 0
2021-02-13T00:38:19 | mmf: Logging to: ./save/train.log
2021-02-13T00:38:19 | mmf_cli.run: Namespace(config_override=None, local_rank=None, opts=['config=projects/movie_mcan/configs/vqa2/defaults.yaml', 'model=movie_mcan', 'dataset=vqa2', 'run_type=train', 'training.num_workers=0', 'training.logger_level=debug'])
2021-02-13T00:38:19 | mmf_cli.run: Torch version: 1.6.0+cu101
2021-02-13T00:38:19 | mmf.utils.general: CUDA Device 0 is: Tesla K80
2021-02-13T00:38:19 | mmf_cli.run: Using seed 19127618

2021-02-13T00:38:19 | mmf.trainers.mmf_trainer: Loading datasets

On interuption

^CTraceback (most recent call last):
File "/home/anaconda3/envs/mmf/bin/mmf_run", line 33, in
sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')())
File "/home/new-mmf/mmf-master/mmf_cli/run.py", line 118, in run
nprocs=config.distributed.world_size,
File "/home/anaconda3/envs/mmf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/anaconda3/envs/mmf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/anaconda3/envs/mmf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 78, in join
timeout=timeout,
File "/home/anaconda3/envs/mmf/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/home/anaconda3/envs/mmf/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/anaconda3/envs/mmf/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

@apsdehal
Copy link
Contributor

Hi @ifrahmaqsood!

We will need some more debugging information from you. Have you tried running some other framework on your machines? For example, can you try running pytorch's imagenet example to see if it works fine for you. https://github.com/pytorch/examples/blob/master/imagenet/main.py

@ifmaq1
Copy link
Author

ifmaq1 commented Feb 14, 2021

@apsdehal I tried training https://github.com/facebookresearch/grid-feats-vqa with detectron2. With 2 GPUS, this model is also not working. What could be the possible reason? Is there some problem with Pytorch?

the training command
/home/anaconda3/envs/detectron2/bin/python train_net.py --num-gpus 2 --config-file configs/R-50-grid.yaml

it just displays
Command Line Args: Namespace(config_file='configs/R-50-grid.yaml', dist_url='tcp://127.0.0.1:50170', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False)

and looks the system is stuck here as, with 1 gpu it starts training

@ifmaq1
Copy link
Author

ifmaq1 commented Feb 16, 2021

@apsdehal Any solution?

@vedanuj
Copy link
Contributor

vedanuj commented Feb 16, 2021

@ifrahmaqsood Which version of NCCL do you have, as I suspect this is a NCCL communication error?

Since it is happening for you on other frameworks as well, we recommend you raise the issue on Pytorch forums/issue, as it is not specific to MMF.

@ifmaq1
Copy link
Author

ifmaq1 commented Feb 16, 2021

@vedanuj
->torch.cuda.nccl.version()
2408

@apsdehal
Copy link
Contributor

@ifrahmaqsood as @vedanuj suggested, I also think you should raise this to PyTorch forums/issues. I would suggest running the pytorch imagenet example I suggested before to confirm and then post your findings either on PyTorch issues or forums. This is no longer related to MMF rather your PyTorch setup. As far as I can tell, dist.barrier call is getting stuck.

@ifmaq1
Copy link
Author

ifmaq1 commented Mar 12, 2021

Thank you for the responses. Yes I had to change the backend from nccl to gloo and it worked

@ifmaq1 ifmaq1 closed this as completed Mar 12, 2021
@g-luo
Copy link

g-luo commented Apr 15, 2021

I just wanted to add that I was having this same problem and changing backend from nccl to gloo worked, ie adding to the config

distributed:
   backend: gloo

(for anyone who takes look at this thread)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants