Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shared memory limit #441

Closed
qizhust opened this issue Aug 1, 2020 · 6 comments
Closed

shared memory limit #441

qizhust opened this issue Aug 1, 2020 · 6 comments
Labels
needs more info Needs more information from the user.

Comments

@qizhust
Copy link

qizhust commented Aug 1, 2020

Hi, is there anyone meeting this error?

RuntimeError: DataLoader worker (pid 263230) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

sysctl kernel.shmmax
(kernel.shmmax = 68719476736)
@vedanuj
Copy link
Contributor

vedanuj commented Aug 1, 2020

Can you provide more details like what exactly was your model/dataset? Also please provide your environment details as instructed in the Bugs issue template.

@vedanuj vedanuj added the needs more info Needs more information from the user. label Aug 1, 2020
@qizhust
Copy link
Author

qizhust commented Aug 1, 2020

Hi, I tried visual bert and butd, the same problem as reported.

CUDA_VISIBLE_DEVICES=0,1,2,3 mmf_run config=projects/butd/configs/coco/defaults.yaml \
    model=butd \
    dataset=coco \
    run_type=train

16G/GPU, pytorch 1.5, python 3.8

@qizhust
Copy link
Author

qizhust commented Aug 3, 2020

more info about the error

2020-08-03T10:20:25 | mmf.utils.general: Total Parameters: 54731884. Trained Parameters: 54731884                                                                                      [1/100]
2020-08-03T10:20:25 | mmf.trainers.core.training_loop: Starting training...                                                                                                                   
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).                                                                                 
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).                                                                                 
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).                                                                                 
Traceback (most recent call last):
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data                                            
    data = self._data_queue.get(timeout=timeout)
  File "/usr/local/python/3.8.2/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/local/python/3.8.2/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/local/python/3.8.2/lib/python3.8/multiprocessing/connection.py", line 424, in _poll                                                                                             
    r = wait([self], timeout)
  File "/usr/local/python/3.8.2/lib/python3.8/multiprocessing/connection.py", line 930, in wait
    ready = selector.select(timeout)
  File "/usr/local/python/3.8.2/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler                                       
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 115340) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/bin/mmf_run", line 33, in <module>
    sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')())
  File "/project/RDS-FEI-qiz-RW/reproduce/VQA/mmf/mmf_cli/run.py", line 122, in run
    main(configuration, predict=predict)
  File "/project/RDS-FEI-qiz-RW/reproduce/VQA/mmf/mmf_cli/run.py", line 56, in main
    trainer.train()
  File "/project/RDS-FEI-qiz-RW/reproduce/VQA/mmf/mmf/trainers/mmf_trainer.py", line 108, in train                                                                                           
    self.training_loop()
  File "/project/RDS-FEI-qiz-RW/reproduce/VQA/mmf/mmf/trainers/core/training_loop.py", line 36, in training_loop                                                                             
    self.run_training_epoch()
  File "/project/RDS-FEI-qiz-RW/reproduce/VQA/mmf/mmf/trainers/core/training_loop.py", line 67, in run_training_epoch                                                                        
    for batch in self.train_loader:
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 345, in __next__                                                 
    data = self._next_data()
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data                                               
    idx, data = self._get_data()
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data                                                
    success, data = self._try_get_data()
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data                                            
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 115340) exited unexpectedly

@qizhust
Copy link
Author

qizhust commented Aug 3, 2020

It is caused by broken data.

@qizhust qizhust closed this as completed Aug 3, 2020
@apsdehal
Copy link
Contributor

apsdehal commented Aug 3, 2020

@qizhust Can you explain more what do you mean when you say broken data? Would be helpful for future users/

@qizhust
Copy link
Author

qizhust commented Aug 3, 2020

Yep, I suffered an unexpected interruption during downloading the dataset, then the following operation triggers core dump fault

with self.env.begin(write=False, buffers=True) as txn:
self.image_ids = pickle.loads(txn.get(b"keys"))
self.image_id_indices = {
self.image_ids[i]: i for i in range(0, len(self.image_ids))
}

It would be better if the download process allows resume from breakpoint or the completion of downloaded files can be examined :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs more info Needs more information from the user.
Projects
None yet
Development

No branches or pull requests

3 participants