Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected bus error. This might be caused by insufficient shared memory #877

Closed
CCYChongyanChen opened this issue Apr 15, 2021 · 10 comments

Comments

@CCYChongyanChen
Copy link

❓ Questions and Help

Hi there, when running pythia and vilbert's for testing with the pretrained model, I meet the "dataloader worker (pid(s) {}) exited unexpectedly" error.

I am not sure why it happens. I only tried to predict one batch and changed the batch size to 4. Why does that still require large shared memory?(I am not sure how to specific the test batch size. I just changed the training batch size to 4 in the default.yaml).
image

If I set training.num_workers=0, it will complain "core dump error" or "process 0 terminated with singal SIGBUS". Should I use something like "testing.num_workers"?

The GPUs I have are GeForce GTX P8 11178MB*2 and the system is Ubuntu 16.

I have checked issue #732
as well as
#441 (How can I know whether the dataset is downloaded correctly?)

Thank you a lot for your help! Really appreciate it!!!

@apsdehal
Copy link
Contributor

Hi,
Can you post full logs?

@CCYChongyanChen
Copy link
Author

Hi,
Can you post full logs?

Sure! I attached it here. Thank you apsdehal!!!

(MMF2) cc67459@soi-edge:~/MMF2/mmf_8_2/mmf$ mmf_predict config=projects/vilbert/configs/vqa2/defaults.yaml \

model=vilbert \
dataset=vqa2 \
run_type=test \
checkpoint.resume_zoo=vilbert.pretrained.vqa2\

/home/cc67459/MMF2/lib/python3.7/site-packages/omegaconf/dictconfig.py:252: UserWarning: Keys with dot (model.bert) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1)
See the compact keys issue for more details: omry/omegaconf#152
You can disable this warning by setting the environment variable OC_DISABLE_DOT_ACCESS_WARNING=1
warnings.warn(message=msg, category=UserWarning)
2021-04-15T13:01:19 | mmf.utils.configuration: Overriding option config to projects/vilbert/configs/vqa2/defaults.yaml
2021-04-15T13:01:19 | mmf.utils.configuration: Overriding option model to vilbert
2021-04-15T13:01:19 | mmf.utils.configuration: Overriding option datasets to vqa2
2021-04-15T13:01:19 | mmf.utils.configuration: Overriding option run_type to test
2021-04-15T13:01:19 | mmf.utils.configuration: Overriding option checkpoint.resume_zoo to vilbert.pretrained.vqa2
2021-04-15T13:01:19 | mmf.utils.configuration: Overriding option evaluation.predict to true
2021-04-15T13:01:21 | mmf.utils.distributed: Distributed Init (Rank 1): tcp://localhost:18337
2021-04-15T13:01:21 | mmf.utils.distributed: Distributed Init (Rank 3): tcp://localhost:18337
2021-04-15T13:01:21 | mmf.utils.distributed: Distributed Init (Rank 2): tcp://localhost:18337
2021-04-15T13:01:21 | mmf.utils.distributed: Distributed Init (Rank 0): tcp://localhost:18337
2021-04-15T13:01:22 | mmf.utils.distributed: Initialized Host soi-edge as Rank 1
2021-04-15T13:01:22 | mmf.utils.distributed: Initialized Host soi-edge as Rank 3
2021-04-15T13:01:22 | mmf.utils.distributed: Initialized Host soi-edge as Rank 2
2021-04-15T13:01:22 | mmf.utils.distributed: Initialized Host soi-edge as Rank 0
2021-04-15T13:01:23 | mmf: Logging to: ./save/train.log
2021-04-15T13:01:23 | mmf_cli.run: Namespace(config_override=None, local_rank=None, opts=['config=projects/vilbert/configs/vqa2/defaults.yaml', 'model=vilbert', 'dataset=vqa2', 'run_type=test', 'checkpoint.resume_zoo=vilbert.pretrained.vqa2', 'evaluation.predict=true'])
2021-04-15T13:01:23 | mmf_cli.run: Torch version: 1.5.0
2021-04-15T13:01:23 | mmf.utils.general: CUDA Device 0 is: GeForce GTX 1080 Ti
2021-04-15T13:01:23 | mmf_cli.run: Using seed 23968514
2021-04-15T13:01:23 | mmf.trainers.mmf_trainer: Loading datasets
2021-04-15T13:01:28 | transformers.tokenization_utils: loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /home/cc67459/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
2021-04-15T13:01:35 | mmf.trainers.mmf_trainer: Loading model
2021-04-15T13:01:35 | transformers.modeling_utils: loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin from cache at /data/vqa/mmf/distributed_-1/aa1ef1aede4482d0dbcd4d52baad8ae300e60902e88fcb0bebdec09afd232066.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
2021-04-15T13:01:38 | transformers.modeling_utils: Weights of ViLBERTBase not initialized from pretrained model: ['bert.v_embeddings.image_embeddings.weight', 'bert.v_embeddings.image_embeddings.bias', 'bert.v_embeddings.image_location_embeddings.weight', 'bert.v_embeddings.image_location_embeddings.bias', 'bert.v_embeddings.LayerNorm.weight', 'bert.v_embeddings.LayerNorm.bias', 'bert.encoder.v_layer.0.attention.self.query.weight', 'bert.encoder.v_layer.0.attention.self.query.bias', 'bert.encoder.v_layer.0.attention.self.key.weight', 'bert.encoder.v_layer.0.attention.self.key.bias', 'bert.encoder.v_layer.0.attention.self.value.weight', 'bert.encoder.v_layer.0.attention.self.value.bias', 'bert.encoder.v_layer.0.attention.output.dense.weight', 'bert.encoder.v_layer.0.attention.output.dense.bias', 'bert.encoder.v_layer.0.attention.output.LayerNorm.weight', 'bert.encoder.v_layer.0.attention.output.LayerNorm.bias', 'bert.encoder.v_layer.0.intermediate.dense.weight', 'bert.encoder.v_layer.0.intermediate.dense.bias', 'bert.encoder.v_layer.0.output.dense.weight', 'bert.encoder.v_layer.0.output.dense.bias', 'bert.encoder.v_layer.0.output.LayerNorm.weight', 'bert.encoder.v_layer.0.output.LayerNorm.bias', 'bert.encoder.v_layer.1.attention.self.query.weight', 'bert.encoder.v_layer.1.attention.self.query.bias', 'bert.encoder.v_layer.1.attention.self.key.weight', 'bert.encoder.v_layer.1.attention.self.key.bias', 'bert.encoder.v_layer.1.attention.self.value.weight', 'bert.encoder.v_layer.1.attention.self.value.bias', 'bert.encoder.v_layer.1.attention.output.dense.weight', 'bert.encoder.v_layer.1.attention.output.dense.bias', 'bert.encoder.v_layer.1.attention.output.LayerNorm.weight', 'bert.encoder.v_layer.1.attention.output.LayerNorm.bias', 'bert.encoder.v_layer.1.intermediate.dense.weight', 'bert.encoder.v_layer.1.intermediate.dense.bias', 'bert.encoder.v_layer.1.output.dense.weight', 'bert.encoder.v_layer.1.output.dense.bias', 'bert.encoder.v_layer.1.output.LayerNorm.weight', 'bert.encoder.v_layer.1.output.LayerNorm.bias', 'bert.encoder.v_layer.2.attention.self.query.weight', 'bert.encoder.v_layer.2.attention.self.query.bias', 'bert.encoder.v_layer.2.attention.self.key.weight', 'bert.encoder.v_layer.2.attention.self.key.bias', 'bert.encoder.v_layer.2.attention.self.value.weight', 'bert.encoder.v_layer.2.attention.self.value.bias', 'bert.encoder.v_layer.2.attention.output.dense.weight', 'bert.encoder.v_layer.2.attention.output.dense.bias', 'bert.encoder.v_layer.2.attention.output.LayerNorm.weight', 'bert.encoder.v_layer.2.attention.output.LayerNorm.bias', 'bert.encoder.v_layer.2.intermediate.dense.weight', 'bert.encoder.v_layer.2.intermediate.dense.bias', 'bert.encoder.v_layer.2.output.dense.weight', 'bert.encoder.v_layer.2.output.dense.bias', 'bert.encoder.v_layer.2.output.LayerNorm.weight', 'bert.encoder.v_layer.2.output.LayerNorm.bias', 'bert.encoder.v_layer.3.attention.self.query.weight', 'bert.encoder.v_layer.3.attention.self.query.bias', 'bert.encoder.v_layer.3.attention.self.key.weight', 'bert.encoder.v_layer.3.attention.self.key.bias', 'bert.encoder.v_layer.3.attention.self.value.weight', 'bert.encoder.v_layer.3.attention.self.value.bias', 'bert.encoder.v_layer.3.attention.output.dense.weight', 'bert.encoder.v_layer.3.attention.output.dense.bias', 'bert.encoder.v_layer.3.attention.output.LayerNorm.weight', 'bert.encoder.v_layer.3.attention.output.LayerNorm.bias', 'bert.encoder.v_layer.3.intermediate.dense.weight', 'bert.encoder.v_layer.3.intermediate.dense.bias', 'bert.encoder.v_layer.3.output.dense.weight', 'bert.encoder.v_layer.3.output.dense.bias', 'bert.encoder.v_layer.3.output.LayerNorm.weight', 'bert.encoder.v_layer.3.output.LayerNorm.bias', 'bert.encoder.v_layer.4.attention.self.query.weight', 'bert.encoder.v_layer.4.attention.self.query.bias', 'bert.encoder.v_layer.4.attention.self.key.weight', 'bert.encoder.v_layer.4.attention.self.key.bias', 'bert.encoder.v_layer.4.attention.self.value.weight', 'bert.encoder.v_layer.4.attention.self.value.bias', 'bert.encoder.v_layer.4.attention.output.dense.weight', 'bert.encoder.v_layer.4.attention.output.dense.bias', 'bert.encoder.v_layer.4.attention.output.LayerNorm.weight', 'bert.encoder.v_layer.4.attention.output.LayerNorm.bias', 'bert.encoder.v_layer.4.intermediate.dense.weight', 'bert.encoder.v_layer.4.intermediate.dense.bias', 'bert.encoder.v_layer.4.output.dense.weight', 'bert.encoder.v_layer.4.output.dense.bias', 'bert.encoder.v_layer.4.output.LayerNorm.weight', 'bert.encoder.v_layer.4.output.LayerNorm.bias', 'bert.encoder.v_layer.5.attention.self.query.weight', 'bert.encoder.v_layer.5.attention.self.query.bias', 'bert.encoder.v_layer.5.attention.self.key.weight', 'bert.encoder.v_layer.5.attention.self.key.bias', 'bert.encoder.v_layer.5.attention.self.value.weight', 'bert.encoder.v_layer.5.attention.self.value.bias', 'bert.encoder.v_layer.5.attention.output.dense.weight', 'bert.encoder.v_layer.5.attention.output.dense.bias', 'bert.encoder.v_layer.5.attention.output.LayerNorm.weight', 'bert.encoder.v_layer.5.attention.output.LayerNorm.bias', 'bert.encoder.v_layer.5.intermediate.dense.weight', 'bert.encoder.v_layer.5.intermediate.dense.bias', 'bert.encoder.v_layer.5.output.dense.weight', 'bert.encoder.v_layer.5.output.dense.bias', 'bert.encoder.v_layer.5.output.LayerNorm.weight', 'bert.encoder.v_layer.5.output.LayerNorm.bias', 'bert.encoder.c_layer.0.biattention.query1.weight', 'bert.encoder.c_layer.0.biattention.query1.bias', 'bert.encoder.c_layer.0.biattention.key1.weight', 'bert.encoder.c_layer.0.biattention.key1.bias', 'bert.encoder.c_layer.0.biattention.value1.weight', 'bert.encoder.c_layer.0.biattention.value1.bias', 'bert.encoder.c_layer.0.biattention.query2.weight', 'bert.encoder.c_layer.0.biattention.query2.bias', 'bert.encoder.c_layer.0.biattention.key2.weight', 'bert.encoder.c_layer.0.biattention.key2.bias', 'bert.encoder.c_layer.0.biattention.value2.weight', 'bert.encoder.c_layer.0.biattention.value2.bias', 'bert.encoder.c_layer.0.biOutput.dense1.weight', 'bert.encoder.c_layer.0.biOutput.dense1.bias', 'bert.encoder.c_layer.0.biOutput.LayerNorm1.weight', 'bert.encoder.c_layer.0.biOutput.LayerNorm1.bias', 'bert.encoder.c_layer.0.biOutput.q_dense1.weight', 'bert.encoder.c_layer.0.biOutput.q_dense1.bias', 'bert.encoder.c_layer.0.biOutput.dense2.weight', 'bert.encoder.c_layer.0.biOutput.dense2.bias', 'bert.encoder.c_layer.0.biOutput.LayerNorm2.weight', 'bert.encoder.c_layer.0.biOutput.LayerNorm2.bias', 'bert.encoder.c_layer.0.biOutput.q_dense2.weight', 'bert.encoder.c_layer.0.biOutput.q_dense2.bias', 'bert.encoder.c_layer.0.v_intermediate.dense.weight', 'bert.encoder.c_layer.0.v_intermediate.dense.bias', 'bert.encoder.c_layer.0.v_output.dense.weight', 'bert.encoder.c_layer.0.v_output.dense.bias', 'bert.encoder.c_layer.0.v_output.LayerNorm.weight', 'bert.encoder.c_layer.0.v_output.LayerNorm.bias', 'bert.encoder.c_layer.0.t_intermediate.dense.weight', 'bert.encoder.c_layer.0.t_intermediate.dense.bias', 'bert.encoder.c_layer.0.t_output.dense.weight', 'bert.encoder.c_layer.0.t_output.dense.bias', 'bert.encoder.c_layer.0.t_output.LayerNorm.weight', 'bert.encoder.c_layer.0.t_output.LayerNorm.bias', 'bert.encoder.c_layer.1.biattention.query1.weight', 'bert.encoder.c_layer.1.biattention.query1.bias', 'bert.encoder.c_layer.1.biattention.key1.weight', 'bert.encoder.c_layer.1.biattention.key1.bias', 'bert.encoder.c_layer.1.biattention.value1.weight', 'bert.encoder.c_layer.1.biattention.value1.bias', 'bert.encoder.c_layer.1.biattention.query2.weight', 'bert.encoder.c_layer.1.biattention.query2.bias', 'bert.encoder.c_layer.1.biattention.key2.weight', 'bert.encoder.c_layer.1.biattention.key2.bias', 'bert.encoder.c_layer.1.biattention.value2.weight', 'bert.encoder.c_layer.1.biattention.value2.bias', 'bert.encoder.c_layer.1.biOutput.dense1.weight', 'bert.encoder.c_layer.1.biOutput.dense1.bias', 'bert.encoder.c_layer.1.biOutput.LayerNorm1.weight', 'bert.encoder.c_layer.1.biOutput.LayerNorm1.bias', 'bert.encoder.c_layer.1.biOutput.q_dense1.weight', 'bert.encoder.c_layer.1.biOutput.q_dense1.bias', 'bert.encoder.c_layer.1.biOutput.dense2.weight', 'bert.encoder.c_layer.1.biOutput.dense2.bias', 'bert.encoder.c_layer.1.biOutput.LayerNorm2.weight', 'bert.encoder.c_layer.1.biOutput.LayerNorm2.bias', 'bert.encoder.c_layer.1.biOutput.q_dense2.weight', 'bert.encoder.c_layer.1.biOutput.q_dense2.bias', 'bert.encoder.c_layer.1.v_intermediate.dense.weight', 'bert.encoder.c_layer.1.v_intermediate.dense.bias', 'bert.encoder.c_layer.1.v_output.dense.weight', 'bert.encoder.c_layer.1.v_output.dense.bias', 'bert.encoder.c_layer.1.v_output.LayerNorm.weight', 'bert.encoder.c_layer.1.v_output.LayerNorm.bias', 'bert.encoder.c_layer.1.t_intermediate.dense.weight', 'bert.encoder.c_layer.1.t_intermediate.dense.bias', 'bert.encoder.c_layer.1.t_output.dense.weight', 'bert.encoder.c_layer.1.t_output.dense.bias', 'bert.encoder.c_layer.1.t_output.LayerNorm.weight', 'bert.encoder.c_layer.1.t_output.LayerNorm.bias', 'bert.encoder.c_layer.2.biattention.query1.weight', 'bert.encoder.c_layer.2.biattention.query1.bias', 'bert.encoder.c_layer.2.biattention.key1.weight', 'bert.encoder.c_layer.2.biattention.key1.bias', 'bert.encoder.c_layer.2.biattention.value1.weight', 'bert.encoder.c_layer.2.biattention.value1.bias', 'bert.encoder.c_layer.2.biattention.query2.weight', 'bert.encoder.c_layer.2.biattention.query2.bias', 'bert.encoder.c_layer.2.biattention.key2.weight', 'bert.encoder.c_layer.2.biattention.key2.bias', 'bert.encoder.c_layer.2.biattention.value2.weight', 'bert.encoder.c_layer.2.biattention.value2.bias', 'bert.encoder.c_layer.2.biOutput.dense1.weight', 'bert.encoder.c_layer.2.biOutput.dense1.bias', 'bert.encoder.c_layer.2.biOutput.LayerNorm1.weight', 'bert.encoder.c_layer.2.biOutput.LayerNorm1.bias', 'bert.encoder.c_layer.2.biOutput.q_dense1.weight', 'bert.encoder.c_layer.2.biOutput.q_dense1.bias', 'bert.encoder.c_layer.2.biOutput.dense2.weight', 'bert.encoder.c_layer.2.biOutput.dense2.bias', 'bert.encoder.c_layer.2.biOutput.LayerNorm2.weight', 'bert.encoder.c_layer.2.biOutput.LayerNorm2.bias', 'bert.encoder.c_layer.2.biOutput.q_dense2.weight', 'bert.encoder.c_layer.2.biOutput.q_dense2.bias', 'bert.encoder.c_layer.2.v_intermediate.dense.weight', 'bert.encoder.c_layer.2.v_intermediate.dense.bias', 'bert.encoder.c_layer.2.v_output.dense.weight', 'bert.encoder.c_layer.2.v_output.dense.bias', 'bert.encoder.c_layer.2.v_output.LayerNorm.weight', 'bert.encoder.c_layer.2.v_output.LayerNorm.bias', 'bert.encoder.c_layer.2.t_intermediate.dense.weight', 'bert.encoder.c_layer.2.t_intermediate.dense.bias', 'bert.encoder.c_layer.2.t_output.dense.weight', 'bert.encoder.c_layer.2.t_output.dense.bias', 'bert.encoder.c_layer.2.t_output.LayerNorm.weight', 'bert.encoder.c_layer.2.t_output.LayerNorm.bias', 'bert.encoder.c_layer.3.biattention.query1.weight', 'bert.encoder.c_layer.3.biattention.query1.bias', 'bert.encoder.c_layer.3.biattention.key1.weight', 'bert.encoder.c_layer.3.biattention.key1.bias', 'bert.encoder.c_layer.3.biattention.value1.weight', 'bert.encoder.c_layer.3.biattention.value1.bias', 'bert.encoder.c_layer.3.biattention.query2.weight', 'bert.encoder.c_layer.3.biattention.query2.bias', 'bert.encoder.c_layer.3.biattention.key2.weight', 'bert.encoder.c_layer.3.biattention.key2.bias', 'bert.encoder.c_layer.3.biattention.value2.weight', 'bert.encoder.c_layer.3.biattention.value2.bias', 'bert.encoder.c_layer.3.biOutput.dense1.weight', 'bert.encoder.c_layer.3.biOutput.dense1.bias', 'bert.encoder.c_layer.3.biOutput.LayerNorm1.weight', 'bert.encoder.c_layer.3.biOutput.LayerNorm1.bias', 'bert.encoder.c_layer.3.biOutput.q_dense1.weight', 'bert.encoder.c_layer.3.biOutput.q_dense1.bias', 'bert.encoder.c_layer.3.biOutput.dense2.weight', 'bert.encoder.c_layer.3.biOutput.dense2.bias', 'bert.encoder.c_layer.3.biOutput.LayerNorm2.weight', 'bert.encoder.c_layer.3.biOutput.LayerNorm2.bias', 'bert.encoder.c_layer.3.biOutput.q_dense2.weight', 'bert.encoder.c_layer.3.biOutput.q_dense2.bias', 'bert.encoder.c_layer.3.v_intermediate.dense.weight', 'bert.encoder.c_layer.3.v_intermediate.dense.bias', 'bert.encoder.c_layer.3.v_output.dense.weight', 'bert.encoder.c_layer.3.v_output.dense.bias', 'bert.encoder.c_layer.3.v_output.LayerNorm.weight', 'bert.encoder.c_layer.3.v_output.LayerNorm.bias', 'bert.encoder.c_layer.3.t_intermediate.dense.weight', 'bert.encoder.c_layer.3.t_intermediate.dense.bias', 'bert.encoder.c_layer.3.t_output.dense.weight', 'bert.encoder.c_layer.3.t_output.dense.bias', 'bert.encoder.c_layer.3.t_output.LayerNorm.weight', 'bert.encoder.c_layer.3.t_output.LayerNorm.bias', 'bert.encoder.c_layer.4.biattention.query1.weight', 'bert.encoder.c_layer.4.biattention.query1.bias', 'bert.encoder.c_layer.4.biattention.key1.weight', 'bert.encoder.c_layer.4.biattention.key1.bias', 'bert.encoder.c_layer.4.biattention.value1.weight', 'bert.encoder.c_layer.4.biattention.value1.bias', 'bert.encoder.c_layer.4.biattention.query2.weight', 'bert.encoder.c_layer.4.biattention.query2.bias', 'bert.encoder.c_layer.4.biattention.key2.weight', 'bert.encoder.c_layer.4.biattention.key2.bias', 'bert.encoder.c_layer.4.biattention.value2.weight', 'bert.encoder.c_layer.4.biattention.value2.bias', 'bert.encoder.c_layer.4.biOutput.dense1.weight', 'bert.encoder.c_layer.4.biOutput.dense1.bias', 'bert.encoder.c_layer.4.biOutput.LayerNorm1.weight', 'bert.encoder.c_layer.4.biOutput.LayerNorm1.bias', 'bert.encoder.c_layer.4.biOutput.q_dense1.weight', 'bert.encoder.c_layer.4.biOutput.q_dense1.bias', 'bert.encoder.c_layer.4.biOutput.dense2.weight', 'bert.encoder.c_layer.4.biOutput.dense2.bias', 'bert.encoder.c_layer.4.biOutput.LayerNorm2.weight', 'bert.encoder.c_layer.4.biOutput.LayerNorm2.bias', 'bert.encoder.c_layer.4.biOutput.q_dense2.weight', 'bert.encoder.c_layer.4.biOutput.q_dense2.bias', 'bert.encoder.c_layer.4.v_intermediate.dense.weight', 'bert.encoder.c_layer.4.v_intermediate.dense.bias', 'bert.encoder.c_layer.4.v_output.dense.weight', 'bert.encoder.c_layer.4.v_output.dense.bias', 'bert.encoder.c_layer.4.v_output.LayerNorm.weight', 'bert.encoder.c_layer.4.v_output.LayerNorm.bias', 'bert.encoder.c_layer.4.t_intermediate.dense.weight', 'bert.encoder.c_layer.4.t_intermediate.dense.bias', 'bert.encoder.c_layer.4.t_output.dense.weight', 'bert.encoder.c_layer.4.t_output.dense.bias', 'bert.encoder.c_layer.4.t_output.LayerNorm.weight', 'bert.encoder.c_layer.4.t_output.LayerNorm.bias', 'bert.encoder.c_layer.5.biattention.query1.weight', 'bert.encoder.c_layer.5.biattention.query1.bias', 'bert.encoder.c_layer.5.biattention.key1.weight', 'bert.encoder.c_layer.5.biattention.key1.bias', 'bert.encoder.c_layer.5.biattention.value1.weight', 'bert.encoder.c_layer.5.biattention.value1.bias', 'bert.encoder.c_layer.5.biattention.query2.weight', 'bert.encoder.c_layer.5.biattention.query2.bias', 'bert.encoder.c_layer.5.biattention.key2.weight', 'bert.encoder.c_layer.5.biattention.key2.bias', 'bert.encoder.c_layer.5.biattention.value2.weight', 'bert.encoder.c_layer.5.biattention.value2.bias', 'bert.encoder.c_layer.5.biOutput.dense1.weight', 'bert.encoder.c_layer.5.biOutput.dense1.bias', 'bert.encoder.c_layer.5.biOutput.LayerNorm1.weight', 'bert.encoder.c_layer.5.biOutput.LayerNorm1.bias', 'bert.encoder.c_layer.5.biOutput.q_dense1.weight', 'bert.encoder.c_layer.5.biOutput.q_dense1.bias', 'bert.encoder.c_layer.5.biOutput.dense2.weight', 'bert.encoder.c_layer.5.biOutput.dense2.bias', 'bert.encoder.c_layer.5.biOutput.LayerNorm2.weight', 'bert.encoder.c_layer.5.biOutput.LayerNorm2.bias', 'bert.encoder.c_layer.5.biOutput.q_dense2.weight', 'bert.encoder.c_layer.5.biOutput.q_dense2.bias', 'bert.encoder.c_layer.5.v_intermediate.dense.weight', 'bert.encoder.c_layer.5.v_intermediate.dense.bias', 'bert.encoder.c_layer.5.v_output.dense.weight', 'bert.encoder.c_layer.5.v_output.dense.bias', 'bert.encoder.c_layer.5.v_output.LayerNorm.weight', 'bert.encoder.c_layer.5.v_output.LayerNorm.bias', 'bert.encoder.c_layer.5.t_intermediate.dense.weight', 'bert.encoder.c_layer.5.t_intermediate.dense.bias', 'bert.encoder.c_layer.5.t_output.dense.weight', 'bert.encoder.c_layer.5.t_output.dense.bias', 'bert.encoder.c_layer.5.t_output.LayerNorm.weight', 'bert.encoder.c_layer.5.t_output.LayerNorm.bias', 'bert.t_pooler.dense.weight', 'bert.t_pooler.dense.bias', 'bert.v_pooler.dense.weight', 'bert.v_pooler.dense.bias']
2021-04-15T13:01:38 | transformers.modeling_utils: Weights from pretrained model not used in ViLBERTBase: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
2021-04-15T13:01:39 | mmf.trainers.mmf_trainer: Loading optimizer
2021-04-15T13:01:39 | mmf.trainers.mmf_trainer: Loading metrics
WARNING 2021-04-15T13:01:39 | py.warnings: /home/cc67459/MMF2/mmf_8_2/mmf/mmf/utils/distributed.py:272: UserWarning: Keys with dot (model.bert) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1)
See the compact keys issue for more details: omry/omegaconf#152
You can disable this warning by setting the environment variable OC_DISABLE_DOT_ACCESS_WARNING=1
builtin_warn(*args, **kwargs)

WARNING 2021-04-15T13:01:39 | py.warnings: /home/cc67459/MMF2/mmf_8_2/mmf/mmf/utils/distributed.py:272: UserWarning: Keys with dot (model.bert) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1)
See the compact keys issue for more details: omry/omegaconf#152
You can disable this warning by setting the environment variable OC_DISABLE_DOT_ACCESS_WARNING=1
builtin_warn(*args, **kwargs)

2021-04-15T13:01:39 | mmf.utils.checkpoint: Loading checkpoint
WARNING 2021-04-15T13:01:40 | mmf: Key data_parallel is not present in registry, returning default value of None
WARNING 2021-04-15T13:01:40 | mmf: Key distributed is not present in registry, returning default value of None
WARNING 2021-04-15T13:01:40 | mmf: Key data_parallel is not present in registry, returning default value of None
WARNING 2021-04-15T13:01:40 | mmf: Key distributed is not present in registry, returning default value of None
WARNING 2021-04-15T13:01:41 | py.warnings: /home/cc67459/MMF2/mmf_8_2/mmf/mmf/utils/distributed.py:272: UserWarning: 'optimizer' key is not present in the checkpoint asked to be loaded. Skipping.
builtin_warn(*args, **kwargs)

WARNING 2021-04-15T13:01:41 | py.warnings: /home/cc67459/MMF2/mmf_8_2/mmf/mmf/utils/distributed.py:272: UserWarning: 'optimizer' key is not present in the checkpoint asked to be loaded. Skipping.
builtin_warn(*args, **kwargs)

2021-04-15T13:01:41 | mmf.utils.checkpoint: Checkpoint loaded
2021-04-15T13:01:41 | mmf.trainers.core.evaluation_loop: Starting test inference predictions
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/home/cc67459/MMF2/bin/mmf_predict", line 33, in
sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_predict')())
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/predict.py", line 15, in predict
run(predict=True)
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 118, in run
nprocs=config.distributed.world_size,
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/cc67459/anaconda3/lib/python3.7/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/home/cc67459/anaconda3/lib/python3.7/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/cc67459/anaconda3/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/home/cc67459/anaconda3/lib/python3.7/multiprocessing/connection.py", line 920, in wait
ready = selector.select(timeout)
File "/home/cc67459/anaconda3/lib/python3.7/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 8161) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 66, in distributed_main
main(configuration, init_distributed=True, predict=predict)
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf_cli/run.py", line 54, in main
trainer.inference()
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/mmf_trainer.py", line 123, in inference
self.prediction_loop(dataset)
File "/home/cc67459/MMF2/mmf_8_2/mmf/mmf/trainers/core/evaluation_loop.py", line 61, in prediction_loop
batch=next(iter(dataloader))
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
success, data = self._try_get_data()
File "/home/cc67459/MMF2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 8161) exited unexpectedly

@apsdehal
Copy link
Contributor

Thanks for the logs. This is a general known issue where a machine's shared memory is not enough for pytorch usage. See this thread for possible solutions for increasing shared memory: https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/11 especially https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/13.

I can confirm that this error has happened to me in past and I was able to fix it by following this suggestion and also increasing ulimit. Let me know if this works or not.

@CCYChongyanChen
Copy link
Author

CCYChongyanChen commented Apr 15, 2021

Thanks for the logs. This is a general known issue where a machine's shared memory is not enough for pytorch usage. See this thread for possible solutions for increasing shared memory: https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/11 especially https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/13.

I can confirm that this error has happened to me in past and I was able to fix it by following this suggestion and also increasing ulimit. Let me know if this works or not.

Thanks for the reply!
I checked "sysctl kernel.shmmax" and mine had already set to 18446744073692774399....I guess it is not where the problem is...
I also tried "watch -n .3 df -h". It turns out that I have 32G for /dev/shm but the code stopped when it reaches 24K....@apsdehal

@CCYChongyanChen
Copy link
Author

  1. I am using it just for testing. Could I set training.num_workers=0? Or is there command like testing.num_workers=0?
  2. How to change test batch size in configures? I set the training batch size to 4 in the defaults.yaml and it seems that the testing batch size is the same as the training batch size.
  3. How can I know whether the dataset is downloaded completely? As mentioned in shared memory limit #441, where the shared memory error is resolved by redownloading the dataset.
  4. I only aims to examine one batch with a batch size =4. Thus I think I do not need that much shared memory and I should not need to download the whole dataset.
  5. Anything else I can check with to solve this issue?

Thank you a lot!

@CCYChongyanChen
Copy link
Author

CCYChongyanChen commented Apr 16, 2021

I also tried #772, which also complains issue about "nprocs=config.distributed.world_size" and "pid". I tried to change the backend from NCCL to gloo but it still doesnt work.

@apsdehal
Copy link
Contributor

  1. training.num_workers sets num_workers for all dataloaders.
  2. This is currently not possible. training batch size is equal to testing batch size. This feature will be added sometime in future.
  3. You can shasum the tar file downloaded for the features (lmdb) and compare it with the shasum in the zoo to make sure they are same.
  4. The problem is not that. LMDB do need full dataset to load properly, otherwise they won't work.
  5. From your error logs it is pretty clear that the error is most probably due to shared memory. See this https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/26 and https://stackoverflow.com/questions/58804022/how-to-resize-dev-shm.
  6. I would also suggest to try testing a smaller model maybe CNNLSTM with some changes in config for VQA 2.0.

@CCYChongyanChen
Copy link
Author

Thanks for the reply!!!@apsdehal
For the question 5, I checked "sysctl kernel.shmmax" and mine had already set to 18446744073692774399.
I also tried "watch -n .3 df -h". It turns out that I have 32G for /dev/shm but the code stopped when it reaches 24K.
How large do you suggest me to resize the /dev/shm to?

@apsdehal
Copy link
Contributor

Try doubling it if you have that much memory?

@CCYChongyanChen
Copy link
Author

CCYChongyanChen commented Apr 16, 2021

Fixed, see #441.
Can print the "self.base_path" in the feature_reader.py to see which dataset is not loaded correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants