shared memory limit #441

qizhust · 2020-08-01T01:19:11Z

Hi, is there anyone meeting this error?

RuntimeError: DataLoader worker (pid 263230) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

sysctl kernel.shmmax
(kernel.shmmax = 68719476736)

The text was updated successfully, but these errors were encountered:

vedanuj · 2020-08-01T08:49:23Z

Can you provide more details like what exactly was your model/dataset? Also please provide your environment details as instructed in the Bugs issue template.

qizhust · 2020-08-01T13:45:34Z

Hi, I tried visual bert and butd, the same problem as reported.

CUDA_VISIBLE_DEVICES=0,1,2,3 mmf_run config=projects/butd/configs/coco/defaults.yaml \
    model=butd \
    dataset=coco \
    run_type=train

16G/GPU, pytorch 1.5, python 3.8

qizhust · 2020-08-03T00:23:04Z

more info about the error

2020-08-03T10:20:25 | mmf.utils.general: Total Parameters: 54731884. Trained Parameters: 54731884                                                                                      [1/100]
2020-08-03T10:20:25 | mmf.trainers.core.training_loop: Starting training...                                                                                                                   
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).                                                                                 
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).                                                                                 
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).                                                                                 
Traceback (most recent call last):
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data                                            
    data = self._data_queue.get(timeout=timeout)
  File "/usr/local/python/3.8.2/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/local/python/3.8.2/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/local/python/3.8.2/lib/python3.8/multiprocessing/connection.py", line 424, in _poll                                                                                             
    r = wait([self], timeout)
  File "/usr/local/python/3.8.2/lib/python3.8/multiprocessing/connection.py", line 930, in wait
    ready = selector.select(timeout)
  File "/usr/local/python/3.8.2/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler                                       
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 115340) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/bin/mmf_run", line 33, in <module>
    sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')())
  File "/project/RDS-FEI-qiz-RW/reproduce/VQA/mmf/mmf_cli/run.py", line 122, in run
    main(configuration, predict=predict)
  File "/project/RDS-FEI-qiz-RW/reproduce/VQA/mmf/mmf_cli/run.py", line 56, in main
    trainer.train()
  File "/project/RDS-FEI-qiz-RW/reproduce/VQA/mmf/mmf/trainers/mmf_trainer.py", line 108, in train                                                                                           
    self.training_loop()
  File "/project/RDS-FEI-qiz-RW/reproduce/VQA/mmf/mmf/trainers/core/training_loop.py", line 36, in training_loop                                                                             
    self.run_training_epoch()
  File "/project/RDS-FEI-qiz-RW/reproduce/VQA/mmf/mmf/trainers/core/training_loop.py", line 67, in run_training_epoch                                                                        
    for batch in self.train_loader:
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 345, in __next__                                                 
    data = self._next_data()
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data                                               
    idx, data = self._get_data()
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data                                                
    success, data = self._try_get_data()
  File "/project/RDS-FEI-qiz-RW/envs/pt15-tv06-py38-cu102/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data                                            
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 115340) exited unexpectedly

qizhust · 2020-08-03T07:45:49Z

It is caused by broken data.

apsdehal · 2020-08-03T07:47:17Z

@qizhust Can you explain more what do you mean when you say broken data? Would be helpful for future users/

qizhust · 2020-08-03T09:49:54Z

Yep, I suffered an unexpected interruption during downloading the dataset, then the following operation triggers core dump fault

mmf/mmf/datasets/databases/readers/feature_readers.py

Lines 195 to 199 in 9ba7af6

    
           with self.env.begin(write=False, buffers=True) as txn: 
        
               self.image_ids = pickle.loads(txn.get(b"keys")) 
        
               self.image_id_indices = { 
        
                   self.image_ids[i]: i for i in range(0, len(self.image_ids)) 
        
               }

It would be better if the download process allows resume from breakpoint or the completion of downloaded files can be examined :)

vedanuj added the needs more info Needs more information from the user. label Aug 1, 2020

qizhust closed this as completed Aug 3, 2020

CCYChongyanChen mentioned this issue Apr 15, 2021

Unexpected bus error. This might be caused by insufficient shared memory #877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shared memory limit #441

shared memory limit #441

qizhust commented Aug 1, 2020

vedanuj commented Aug 1, 2020

qizhust commented Aug 1, 2020

qizhust commented Aug 3, 2020

qizhust commented Aug 3, 2020

apsdehal commented Aug 3, 2020

qizhust commented Aug 3, 2020

shared memory limit #441

shared memory limit #441

Comments

qizhust commented Aug 1, 2020

vedanuj commented Aug 1, 2020

qizhust commented Aug 1, 2020

qizhust commented Aug 3, 2020

qizhust commented Aug 3, 2020

apsdehal commented Aug 3, 2020

qizhust commented Aug 3, 2020