-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training and loading dataset stuck #772
Comments
Hi! You can try to set the |
@ytsheng here is my result when I interrupt the training. Even with 0 num_worker. CUDA_VISIBLE_DEVICES=0,1 mmf_run config=projects/movie_mcan/configs/vqa2/defaults.yaml \model=movie_mcan \dataset=vqa2 \run_type=train \training.num_workers=0 \training.logger_level=debug**2021-02-13T00:38:14 | mmf.utils.configuration: Overriding option config to projects/movie_mcan/configs/vqa2/defaults.yaml 2021-02-13T00:38:19 | mmf.trainers.mmf_trainer: Loading datasetsOn interuption^CTraceback (most recent call last): |
Hi @ifrahmaqsood! We will need some more debugging information from you. Have you tried running some other framework on your machines? For example, can you try running pytorch's imagenet example to see if it works fine for you. https://github.com/pytorch/examples/blob/master/imagenet/main.py |
@apsdehal I tried training https://github.com/facebookresearch/grid-feats-vqa with detectron2. With 2 GPUS, this model is also not working. What could be the possible reason? Is there some problem with Pytorch? the training command it just displays and looks the system is stuck here as, with 1 gpu it starts training |
@apsdehal Any solution? |
@ifrahmaqsood Which version of NCCL do you have, as I suspect this is a NCCL communication error? Since it is happening for you on other frameworks as well, we recommend you raise the issue on Pytorch forums/issue, as it is not specific to MMF. |
@vedanuj |
@ifrahmaqsood as @vedanuj suggested, I also think you should raise this to PyTorch forums/issues. I would suggest running the pytorch imagenet example I suggested before to confirm and then post your findings either on PyTorch issues or forums. This is no longer related to MMF rather your PyTorch setup. As far as I can tell, dist.barrier call is getting stuck. |
Thank you for the responses. Yes I had to change the backend from nccl to gloo and it worked |
I just wanted to add that I was having this same problem and changing backend from nccl to gloo worked, ie adding to the config
(for anyone who takes look at this thread) |
❓ Questions and Help
I am training a movie_mcan model and tried training a pythia model too. I have grid features of coco dataset(with resnet-50 architecture) saved on my storage. These are hundreds of gbs in total. with 2 gpu I waited 8 hours so i could load these features and atleast print something. But it didn't as it was stuck on loading datasets and couldn't go to starting training
my environment details
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 16.04.7 LTS
GCC version: (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
CMake version: version 3.18.2
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80
Nvidia driver version: 430.64
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.2
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.2
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.6.0+cu101
[pip3] torchtext==0.5.0
[pip3] torchvision==0.7.0+cu101
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.0.130 0
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py36he8ac12f_0
[conda] mkl_fft 1.2.0 py36h23d657b_0
[conda] mkl_random 1.1.1 py36h0573a6f_0
[conda] numpy 1.19.2 py36h54aff64_0
[conda] numpy-base 1.19.2 py36hfa32c7d_0
[conda] torch 1.4.0+cu100 pypi_0 pypi
[conda] torchtext 0.5.0 pypi_0 pypi
[conda] torchvision 0.7.0+cu101 pypi_0 pypi
with 2 gpus

i am unable to understand how to resolve this issue. I have tried decreasing num_worker but it didn't work.
The text was updated successfully, but these errors were encountered: