-
Notifications
You must be signed in to change notification settings - Fork 936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected bus error. This might be caused by insufficient shared memory #877
Comments
Hi, |
Sure! I attached it here. Thank you apsdehal!!! (MMF2) cc67459@soi-edge:~/MMF2/mmf_8_2/mmf$ mmf_predict config=projects/vilbert/configs/vqa2/defaults.yaml \
/home/cc67459/MMF2/lib/python3.7/site-packages/omegaconf/dictconfig.py:252: UserWarning: Keys with dot (model.bert) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1) WARNING 2021-04-15T13:01:39 | py.warnings: /home/cc67459/MMF2/mmf_8_2/mmf/mmf/utils/distributed.py:272: UserWarning: Keys with dot (model.bert) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1) 2021-04-15T13:01:39 | mmf.utils.checkpoint: Loading checkpoint WARNING 2021-04-15T13:01:41 | py.warnings: /home/cc67459/MMF2/mmf_8_2/mmf/mmf/utils/distributed.py:272: UserWarning: 'optimizer' key is not present in the checkpoint asked to be loaded. Skipping. 2021-04-15T13:01:41 | mmf.utils.checkpoint: Checkpoint loaded -- Process 3 terminated with the following error: During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Thanks for the logs. This is a general known issue where a machine's shared memory is not enough for pytorch usage. See this thread for possible solutions for increasing shared memory: https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/11 especially https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/13. I can confirm that this error has happened to me in past and I was able to fix it by following this suggestion and also increasing ulimit. Let me know if this works or not. |
Thanks for the reply! |
Thank you a lot! |
I also tried #772, which also complains issue about "nprocs=config.distributed.world_size" and "pid". I tried to change the backend from NCCL to gloo but it still doesnt work. |
|
Thanks for the reply!!!@apsdehal |
Try doubling it if you have that much memory? |
Fixed, see #441. |
❓ Questions and Help
Hi there, when running pythia and vilbert's for testing with the pretrained model, I meet the "dataloader worker (pid(s) {}) exited unexpectedly" error.
I am not sure why it happens. I only tried to predict one batch and changed the batch size to 4. Why does that still require large shared memory?(I am not sure how to specific the test batch size. I just changed the training batch size to 4 in the default.yaml).
If I set training.num_workers=0, it will complain "core dump error" or "process 0 terminated with singal SIGBUS". Should I use something like "testing.num_workers"?
The GPUs I have are GeForce GTX P8 11178MB*2 and the system is Ubuntu 16.
I have checked issue #732
as well as
#441 (How can I know whether the dataset is downloaded correctly?)
Thank you a lot for your help! Really appreciate it!!!
The text was updated successfully, but these errors were encountered: