[BUG] exits with return code = -11 #3989

zxjia2002 · 2023-07-19T02:59:10Z

Describe the bug
I have run the code successfully on a machine with x4 1080Tis. However, when I ran the same code on a machine with x2 3090s, deepspeed report Kill subprocess after [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect). In the end, exits with return code = -11 is prompted.

ds_report output

Screenshots

[2023-07-19 10:43:05,393] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:06,114] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-07-19 10:43:06,115] [INFO] [runner.py:555:main] cmd = /home/hitwh2021/anaconda3/envs/bit/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None src/clip_fine_tune_deepspeed.py --dataset CIRR --api-key HoWzEpTy4klumwh44YcBem6Ia --workspace keepandwin --experiment-name general --num-epoch 2 --clip-model-name RN50x4 --encoder both --learning-rate 2e-6 --batch-size 128 --transform targetpad --target-ratio 1.25 --save-training --save-best --validation-frequency 1 --deepspeed-config ./ds_config.json
[2023-07-19 10:43:06,881] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:07,262] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-07-19 10:43:07,262] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-07-19 10:43:07,262] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-07-19 10:43:07,262] [INFO] [launch.py:163:main] dist_world_size=2
[2023-07-19 10:43:07,262] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-07-19 10:43:08,434] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:08,442] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:09,278] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 32047
[2023-07-19 10:43:09,278] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 32048
[2023-07-19 10:43:09,286] [ERROR] [launch.py:321:sigkill_handler] ['/home/hitwh2021/anaconda3/envs/bit/bin/python', '-u', 'src/clip_fine_tune_deepspeed.py', '--local_rank=1', '--dataset', 'CIRR', '--api-key', 'HoWzEpTy4klumwh44YcBem6Ia', '--workspace', 'keepandwin', '--experiment-name', 'general', '--num-epoch', '2', '--clip-model-name', 'RN50x4', '--encoder', 'both', '--learning-rate', '2e-6', '--batch-size', '128', '--transform', 'targetpad', '--target-ratio', '1.25', '--save-training', '--save-best', '--validation-frequency', '1', '--deepspeed-config', './ds_config.json'] exits with return code = -11

System info (please complete the following information):

OS: Ubuntu 18.04
GPU count and types a machine with x2 3090s
Python version 3.8.16

The text was updated successfully, but these errors were encountered:

jomayeri · 2023-07-19T18:29:10Z

@KeepAndWin, unfortunately I cannot repro this error because I do not have access to those specific GPU types. Most likely there is something incompatible with Cuda or the built in ops on that second device. I suggest trying basic pytorch+cuda first to ensure that works.

jeffra · 2023-07-20T16:59:31Z

@KeepAndWin please see this thread for latest discussion on this: #4002

jeffra · 2023-07-21T18:23:21Z

It seems both -7 and -11 are related to shared memory issues with docker. Please see this reply that has fixed other people's recent issues: #4002 (comment)

jomayeri · 2023-08-04T18:15:49Z

Closing for now.

Anonymousplendid · 2023-09-05T09:50:43Z

Closing for now.

So how do you solve the issue?

Coderella-z · 2023-10-09T13:04:42Z

Closing for now.

So how do you solve the issue?

你解决这个问题了吗？我也遇到了这个问题

zxjia2002 added bug Something isn't working training labels Jul 19, 2023

jomayeri self-assigned this Jul 19, 2023

jeffra mentioned this issue Jul 20, 2023

[BUG] Have "exits with return code = -7" without any error message #4002

Closed

jeffra mentioned this issue Jul 21, 2023

add /dev/shm size to ds_report #4015

Merged

loadams mentioned this issue Jul 31, 2023

[BUG] exits with return code = -11 #4063

Closed

jomayeri closed this as completed Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] exits with return code = -11 #3989

[BUG] exits with return code = -11 #3989

zxjia2002 commented Jul 19, 2023 •

edited

Loading

jomayeri commented Jul 19, 2023

jeffra commented Jul 20, 2023

jeffra commented Jul 21, 2023

jomayeri commented Aug 4, 2023

Anonymousplendid commented Sep 5, 2023

Coderella-z commented Oct 9, 2023

[BUG] exits with return code = -11 #3989

[BUG] exits with return code = -11 #3989

Comments

zxjia2002 commented Jul 19, 2023 • edited Loading

jomayeri commented Jul 19, 2023

jeffra commented Jul 20, 2023

jeffra commented Jul 21, 2023

jomayeri commented Aug 4, 2023

Anonymousplendid commented Sep 5, 2023

Coderella-z commented Oct 9, 2023

zxjia2002 commented Jul 19, 2023 •

edited

Loading