Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unpacking features get stuck / loading datasets get stuck #691

Closed
cyang31 opened this issue Nov 18, 2020 · 3 comments
Closed

unpacking features get stuck / loading datasets get stuck #691

cyang31 opened this issue Nov 18, 2020 · 3 comments

Comments

@cyang31
Copy link

cyang31 commented Nov 18, 2020

Instructions To Reproduce the Issue:

Check https://stackoverflow.com/help/minimal-reproducible-example for how to ask good questions. Simplify the steps to reproduce the issue using suggestions from the above link, and provide them below:

  1. full code you wrote or full changes you made (git diff)
    I didn't change the code.
  2. what exact command you run:
    I run this following code under a singularity container.
    singularity exec --nv singularity.sif mmf_run config=/scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml model=visual_bert dataset=hateful_memes env.data_dir=/scratch/UserName/hateful_meme/data training.num_workers=1 training.fast_read=True
  3. full logs you observed:
    WARNING: underlay of /usr/bin/nvidia-debugdump required more than 50 (375) bind mounts
    /usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py:252: UserWarning: Keys with dot (model.bert) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1)
    See the compact keys issue for more details: Enhancement: Compact key support omry/omegaconf#152
    You can disable this warning by setting the environment variable OC_DISABLE_DOT_ACCESS_WARNING=1
    warnings.warn(message=msg, category=UserWarning)
    2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option config to /scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml
    2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option model to visual_bert
    2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option datasets to hateful_memes
    2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option env.data_dir to /scratch/UserName/hateful_meme/data
    2020-11-18T07:22:20 | mmf: Logging to: ./save/train.log
    2020-11-18T07:22:20 | mmf_cli.run: Namespace(config_override=None, local_rank=None, opts=['config=/scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml', 'model=visual_bert', 'dataset=hateful_memes', 'env.data_dir=/scratch/UserName/hateful_meme/data'])
    2020-11-18T07:22:20 | mmf_cli.run: Torch version: 1.6.0+cu101
    2020-11-18T07:22:20 | mmf.utils.general: CUDA Device 0 is: Tesla V100-PCIE-32GB
    2020-11-18T07:22:20 | mmf_cli.run: Using seed 21259699
    2020-11-18T07:22:20 | mmf.trainers.mmf_trainer: Loading datasets
    [ Starting checksum for features.tar.gz]
    [ Checksum successful for features.tar.gz]
    Unpacking features.tar.gz

Expected behavior:

No error pops out, but it takes forever to unpack the features.tar.gz file, which is unexpected. I tried to manually download and unpack it locally in order to check whether it is the problem of slow unpacking and it turned out not. However, when I rerun the above code after that, it actually bypassed the downloading stage but get stuck again at "mmf.trainers.mmf_trainer: Loading datasets". I waited overnight to make sure it is not just too slow, but nothing changed.

Environment:

WARNING: underlay of /usr/bin/nvidia-debugdump required more than 50 (375) bind mounts
Collecting environment information...
PyTorch version: 1.6.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla K20Xm
Nvidia driver version: 418.39
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] torch==1.6.0+cu101
[pip3] torchtext==0.5.0
[pip3] torchvision==0.7.0+cu101
[conda] Could not collect

@apsdehal
Copy link
Contributor

Hi,
For your command can you run it with this command to see if it solves your issue:

singularity.sif CUDA_VISIBLE_DEVICES=0 mmf_run config=/scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml model=visual_bert dataset=hateful_memes env.data_dir=/scratch/UserName/hateful_meme/data training.num_workers=0

You don't need training.fast_read, it is for something else. Specifically, note CUDA_VISIBLE_DEVICES=0 to run it on single GPU and training.num_workers=0 to run it with only one dataset worker.

@cyang31
Copy link
Author

cyang31 commented Nov 19, 2020

Hi, I tried your suggested command, but it still gets stuck at the same place. If it is useful, I can smoothly launch the code only for baseline Image-Grid. I find that the similar issue always happens when it tries to unpack things, either extras.tar.gz or features.tar.gz. Those files can be automatically downloaded and during unpacking, the size of resulted files keeps growing then becomes stable at some point but the main code gets stuck at "unpacking X.tar.gz" and doesn't change.

@apsdehal
Copy link
Contributor

Something must be off in singularity because this works fine as it is. Can you try running the command outside of singularity?

@cyang31 cyang31 closed this as completed Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants