[BUG] Have "exits with return code = -7" without any error message #4002

ezekielqu · 2023-07-20T08:55:32Z

Describe the bug
When I follow the tutorial train a opt-1.3b model from step 1, i meet this problem.
Log output
[2023-07-20 08:31:04,936] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:05,681] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-07-20 08:31:05,714] [INFO] [runner.py:555:main] cmd = /root/anaconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path facebook/opt-1.3b --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 16 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 2 --deepspeed --output_dir /dsDocker/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
[2023-07-20 08:31:07,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:08,748] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-07-20 08:31:08,749] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-07-20 08:31:08,749] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-07-20 08:31:08,749] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-07-20 08:31:08,749] [INFO] [launch.py:163:main] dist_world_size=4
[2023-07-20 08:31:08,749] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-07-20 08:31:11,347] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:11,353] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:11,354] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:11,366] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:12,692] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,693] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:12,705] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,705] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:12,737] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,738] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:12,738] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-07-20 08:31:12,741] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,741] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:15,760] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1621
[2023-07-20 08:31:15,764] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1622
[2023-07-20 08:31:15,765] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1623
[2023-07-20 08:31:15,765] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1624
[2023-07-20 08:31:15,767] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/deepspeed/bin/python', '-u', 'main.py', '--local_rank=3', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/dsDocker/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = -7

To Reproduce
The command I use is "python3 train.py --step 1 --deployment-type single_node". Without changing anything. And all the requirements are installed.

ds_report output

System info (please complete the following information):

OS: docker with ubuntu 20.04
GPU count and types: one machine with x4 2080Ti
deepspeed version: newest pull from github
Python version: 3.10

Docker context
docker image: nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04

jeffra · 2023-07-20T16:40:39Z

I'm seeing 3 issues that seem somewhat similar here, which is concerning. Lemme link them here: #4000, #3989

Can you all try running a simpler example that still uses deepspeed such at this script that just does a few all-reduces across gpus:

https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36

I am curious if this successfully runs or gives a similar error?

Also another question, are you all running within docker? It seems @ezekielqu is but what about @opprash and @KeepAndWin?

jeffra · 2023-07-20T16:55:24Z

Also talking with @jomayeri a bit offline it sounds like increasing docker shared memory might help with this as well. One way to bump that up is by passing something like --shm-size="2gb" to your docker run command. The default is pretty small and can sometimes cause issues like this.

/cc @opprash and @KeepAndWin

ezekielqu · 2023-07-21T02:12:32Z

Also talking with @jomayeri a bit offline it sounds like increasing docker shared memory might help with this as well. One way to bump that up is by passing something like --shm-size="2gb" to your docker run command. The default is pretty small and can sometimes cause issues like this.

/cc @opprash and @KeepAndWin

Thank you for your advice. I check the default docker shm and find it's only 64M. When I change it up to 64g the script goes well. And I also try "deepspeed all_reduce_bench_v2.py", it exits successfully. Appreciate it for your answer.

opprash · 2023-07-21T06:15:12Z

Thank you so much, this reply managed to help me out.   谨小慎微 ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: "Jeff ***@***.***>; 发送时间: 2023年7月21日(星期五) 凌晨0:49 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [microsoft/DeepSpeed] [BUG] Have "exits with return code = -7" without any error message (Issue #4002) I'm seeing 3 issues that seem somewhat similar here, which is concerning. Lemme link them here: #4000, #3989 Can you all try running a simpler example that still uses deepspeed such at this script that just does a few all-reduces across gpus: https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36 I am curious if this successfully runs or gives a similar error? Also another question, are you all running within docker? It seems @ezekielqu is but what about @opprash and @KeepAndWin? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

ezekielqu added bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat labels Jul 20, 2023

This was referenced Jul 20, 2023

[BUG] exits with return code = -11 #3989

Closed

[BUG]exits with return code = -7 #4000

Closed

jomayeri closed this as completed Jul 21, 2023

tjruwase assigned jeffra Jul 21, 2023

jeffra mentioned this issue Jul 21, 2023

add /dev/shm size to ds_report #4015

Merged

2793145003 mentioned this issue Aug 11, 2023

训练出错但没有报错信息 OpenMOSS/CoLLiE#102

Closed

hiyouga mentioned this issue Jan 9, 2024

chatglm3-6b 用DeepSpeed微调报错exits with return code = -7，但用单卡微调没问题。参考了issue#1683的方法也没用 hiyouga/LLaMA-Factory#2061

Closed

1 task

shag1802 mentioned this issue Jun 21, 2024

[ERROR] [launch.py:321:sigkill_handler] exits with return code = -11 #5690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Have "exits with return code = -7" without any error message #4002

[BUG] Have "exits with return code = -7" without any error message #4002

ezekielqu commented Jul 20, 2023

jeffra commented Jul 20, 2023 •

edited

Loading

jeffra commented Jul 20, 2023

ezekielqu commented Jul 21, 2023

opprash commented Jul 21, 2023 via email

[BUG] Have "exits with return code = -7" without any error message #4002

[BUG] Have "exits with return code = -7" without any error message #4002

Comments

ezekielqu commented Jul 20, 2023

jeffra commented Jul 20, 2023 • edited Loading

jeffra commented Jul 20, 2023

ezekielqu commented Jul 21, 2023

opprash commented Jul 21, 2023 via email

jeffra commented Jul 20, 2023 •

edited

Loading