-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Have "exits with return code = -7" without any error message #4002
Comments
I'm seeing 3 issues that seem somewhat similar here, which is concerning. Lemme link them here: #4000, #3989 Can you all try running a simpler example that still uses deepspeed such at this script that just does a few all-reduces across gpus: https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36 I am curious if this successfully runs or gives a similar error? Also another question, are you all running within docker? It seems @ezekielqu is but what about @opprash and @KeepAndWin? |
Also talking with @jomayeri a bit offline it sounds like increasing docker shared memory might help with this as well. One way to bump that up is by passing something like /cc @opprash and @KeepAndWin |
Thank you for your advice. I check the default docker shm and find it's only 64M. When I change it up to 64g the script goes well. And I also try "deepspeed all_reduce_bench_v2.py", it exits successfully. Appreciate it for your answer. |
Thank you so much, this reply managed to help me out.
谨小慎微
***@***.***
…------------------ 原始邮件 ------------------
发件人: "Jeff ***@***.***>;
发送时间: 2023年7月21日(星期五) 凌晨0:49
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [microsoft/DeepSpeed] [BUG] Have "exits with return code = -7" without any error message (Issue #4002)
I'm seeing 3 issues that seem somewhat similar here, which is concerning. Lemme link them here: #4000, #3989
Can you all try running a simpler example that still uses deepspeed such at this script that just does a few all-reduces across gpus:
https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36
I am curious if this successfully runs or gives a similar error?
Also another question, are you all running within docker? It seems @ezekielqu is but what about @opprash and @KeepAndWin?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Describe the bug
When I follow the tutorial train a opt-1.3b model from step 1, i meet this problem.
Log output
[2023-07-20 08:31:04,936] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:05,681] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-07-20 08:31:05,714] [INFO] [runner.py:555:main] cmd = /root/anaconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path facebook/opt-1.3b --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 16 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 2 --deepspeed --output_dir /dsDocker/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
[2023-07-20 08:31:07,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:08,748] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-07-20 08:31:08,749] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-07-20 08:31:08,749] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-07-20 08:31:08,749] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-07-20 08:31:08,749] [INFO] [launch.py:163:main] dist_world_size=4
[2023-07-20 08:31:08,749] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-07-20 08:31:11,347] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:11,353] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:11,354] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:11,366] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:12,692] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,693] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:12,705] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,705] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:12,737] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,738] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:12,738] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-07-20 08:31:12,741] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,741] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:15,760] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1621
[2023-07-20 08:31:15,764] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1622
[2023-07-20 08:31:15,765] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1623
[2023-07-20 08:31:15,765] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1624
[2023-07-20 08:31:15,767] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/deepspeed/bin/python', '-u', 'main.py', '--local_rank=3', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/dsDocker/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = -7
To Reproduce
The command I use is "python3 train.py --step 1 --deployment-type single_node". Without changing anything. And all the requirements are installed.
ds_report output

System info (please complete the following information):
Docker context
docker image: nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04
The text was updated successfully, but these errors were encountered: