Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader worker (pid 2991): Bus error. #24

Open
Mirandl opened this issue Mar 16, 2022 · 3 comments
Open

DataLoader worker (pid 2991): Bus error. #24

Mirandl opened this issue Mar 16, 2022 · 3 comments

Comments

@Mirandl
Copy link

Mirandl commented Mar 16, 2022

Hi, Thank you for your great work!
When running your code, I got this error:

`Running TCMR on each person tracklet...
0%| | 0/5 [00:00<?, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
0%| | 0/5 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.6/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/usr/lib/python3.6/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2991) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/data/meilin/TCMR/demo.py", line 377, in
main(args)
File "/root/data/meilin/TCMR/demo.py", line 157, in main
for i, batch in enumerate(crop_dataloader):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 974, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 941, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 792, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 2991) exited unexpectedly

Process finished with exit code 1
`
It seems the num_workers need to be adjusted, but I found it's no use...
Can you guide me a little bit for this! Thank you!

@hongsukchoi
Copy link
Owner

the shared memory error message usually indicates that RAM (cpu memory) is insufficient.
I remember that normally the experiment took around 50GB.

Try increasing RAM (more/biggr RAM cards), or though not recommended, make swap memory in the disk

@Mirandl
Copy link
Author

Mirandl commented Mar 17, 2022

Hi, thank you very much for your timely reply.

I have tried this, but it still makes the same error.
My memory is 61GB and share memory is 64MB. I use 21 CPU and 1 GPU.
Should I continue increasing them?

@hongsukchoi
Copy link
Owner

first check the exaxt required memory by htop. I guess at least 128g is safe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants