unpacking features get stuck / loading datasets get stuck #691

cyang31 · 2020-11-18T14:28:55Z

Instructions To Reproduce the Issue:

Check https://stackoverflow.com/help/minimal-reproducible-example for how to ask good questions. Simplify the steps to reproduce the issue using suggestions from the above link, and provide them below:

full code you wrote or full changes you made (git diff)
I didn't change the code.
what exact command you run:
I run this following code under a singularity container.
singularity exec --nv singularity.sif mmf_run config=/scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml model=visual_bert dataset=hateful_memes env.data_dir=/scratch/UserName/hateful_meme/data training.num_workers=1 training.fast_read=True
full logs you observed:
WARNING: underlay of /usr/bin/nvidia-debugdump required more than 50 (375) bind mounts
/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py:252: UserWarning: Keys with dot (model.bert) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1)
See the compact keys issue for more details: Enhancement: Compact key support omry/omegaconf#152
You can disable this warning by setting the environment variable OC_DISABLE_DOT_ACCESS_WARNING=1
warnings.warn(message=msg, category=UserWarning)
2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option config to /scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml
2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option model to visual_bert
2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option datasets to hateful_memes
2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option env.data_dir to /scratch/UserName/hateful_meme/data
2020-11-18T07:22:20 | mmf: Logging to: ./save/train.log
2020-11-18T07:22:20 | mmf_cli.run: Namespace(config_override=None, local_rank=None, opts=['config=/scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml', 'model=visual_bert', 'dataset=hateful_memes', 'env.data_dir=/scratch/UserName/hateful_meme/data'])
2020-11-18T07:22:20 | mmf_cli.run: Torch version: 1.6.0+cu101
2020-11-18T07:22:20 | mmf.utils.general: CUDA Device 0 is: Tesla V100-PCIE-32GB
2020-11-18T07:22:20 | mmf_cli.run: Using seed 21259699
2020-11-18T07:22:20 | mmf.trainers.mmf_trainer: Loading datasets
[ Starting checksum for features.tar.gz]
[ Checksum successful for features.tar.gz]
Unpacking features.tar.gz

Expected behavior:

No error pops out, but it takes forever to unpack the features.tar.gz file, which is unexpected. I tried to manually download and unpack it locally in order to check whether it is the problem of slow unpacking and it turned out not. However, when I rerun the above code after that, it actually bypassed the downloading stage but get stuck again at "mmf.trainers.mmf_trainer: Loading datasets". I waited overnight to make sure it is not just too slow, but nothing changed.

Environment:

WARNING: underlay of /usr/bin/nvidia-debugdump required more than 50 (375) bind mounts
Collecting environment information...
PyTorch version: 1.6.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla K20Xm
Nvidia driver version: 418.39
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] torch==1.6.0+cu101
[pip3] torchtext==0.5.0
[pip3] torchvision==0.7.0+cu101
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

apsdehal · 2020-11-19T03:20:43Z

Hi,
For your command can you run it with this command to see if it solves your issue:

singularity.sif CUDA_VISIBLE_DEVICES=0 mmf_run config=/scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml model=visual_bert dataset=hateful_memes env.data_dir=/scratch/UserName/hateful_meme/data training.num_workers=0

You don't need training.fast_read, it is for something else. Specifically, note CUDA_VISIBLE_DEVICES=0 to run it on single GPU and training.num_workers=0 to run it with only one dataset worker.

cyang31 · 2020-11-19T10:40:43Z

Hi, I tried your suggested command, but it still gets stuck at the same place. If it is useful, I can smoothly launch the code only for baseline Image-Grid. I find that the similar issue always happens when it tries to unpack things, either extras.tar.gz or features.tar.gz. Those files can be automatically downloaded and during unpacking, the size of resulted files keeps growing then becomes stable at some point but the main code gets stuck at "unpacking X.tar.gz" and doesn't change.

apsdehal · 2020-11-19T18:35:38Z

Something must be off in singularity because this works fine as it is. Can you try running the command outside of singularity?

cyang31 closed this as completed Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unpacking features get stuck / loading datasets get stuck #691

unpacking features get stuck / loading datasets get stuck #691

cyang31 commented Nov 18, 2020 •

edited

Loading

apsdehal commented Nov 19, 2020

cyang31 commented Nov 19, 2020

apsdehal commented Nov 19, 2020

unpacking features get stuck / loading datasets get stuck #691

unpacking features get stuck / loading datasets get stuck #691

Comments

cyang31 commented Nov 18, 2020 • edited Loading

Instructions To Reproduce the Issue:

Expected behavior:

Environment:

apsdehal commented Nov 19, 2020

cyang31 commented Nov 19, 2020

apsdehal commented Nov 19, 2020

cyang31 commented Nov 18, 2020 •

edited

Loading