Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Have "exits with return code = -7" without any error message #4002

Closed
ezekielqu opened this issue Jul 20, 2023 · 4 comments
Closed

[BUG] Have "exits with return code = -7" without any error message #4002

ezekielqu opened this issue Jul 20, 2023 · 4 comments
Assignees
Labels
bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat

Comments

@ezekielqu
Copy link

Describe the bug
When I follow the tutorial train a opt-1.3b model from step 1, i meet this problem.
Log output
[2023-07-20 08:31:04,936] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:05,681] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-07-20 08:31:05,714] [INFO] [runner.py:555:main] cmd = /root/anaconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path facebook/opt-1.3b --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 16 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 2 --deepspeed --output_dir /dsDocker/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
[2023-07-20 08:31:07,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:08,748] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-07-20 08:31:08,749] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-07-20 08:31:08,749] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-07-20 08:31:08,749] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-07-20 08:31:08,749] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-07-20 08:31:08,749] [INFO] [launch.py:163:main] dist_world_size=4
[2023-07-20 08:31:08,749] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-07-20 08:31:11,347] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:11,353] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:11,354] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:11,366] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-20 08:31:12,692] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,693] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:12,705] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,705] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:12,737] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,738] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:12,738] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-07-20 08:31:12,741] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-20 08:31:12,741] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-20 08:31:15,760] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1621
[2023-07-20 08:31:15,764] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1622
[2023-07-20 08:31:15,765] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1623
[2023-07-20 08:31:15,765] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1624
[2023-07-20 08:31:15,767] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/deepspeed/bin/python', '-u', 'main.py', '--local_rank=3', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/dsDocker/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = -7

To Reproduce
The command I use is "python3 train.py --step 1 --deployment-type single_node". Without changing anything. And all the requirements are installed.

ds_report output
ds

System info (please complete the following information):

  • OS: docker with ubuntu 20.04
  • GPU count and types: one machine with x4 2080Ti
  • deepspeed version: newest pull from github
  • Python version: 3.10

Docker context
docker image: nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04

@ezekielqu ezekielqu added bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat labels Jul 20, 2023
@jeffra
Copy link
Collaborator

jeffra commented Jul 20, 2023

I'm seeing 3 issues that seem somewhat similar here, which is concerning. Lemme link them here: #4000, #3989

Can you all try running a simpler example that still uses deepspeed such at this script that just does a few all-reduces across gpus:

https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36

I am curious if this successfully runs or gives a similar error?

Also another question, are you all running within docker? It seems @ezekielqu is but what about @opprash and @KeepAndWin?

@jeffra
Copy link
Collaborator

jeffra commented Jul 20, 2023

Also talking with @jomayeri a bit offline it sounds like increasing docker shared memory might help with this as well. One way to bump that up is by passing something like --shm-size="2gb" to your docker run command. The default is pretty small and can sometimes cause issues like this.

/cc @opprash and @KeepAndWin

@ezekielqu
Copy link
Author

Also talking with @jomayeri a bit offline it sounds like increasing docker shared memory might help with this as well. One way to bump that up is by passing something like --shm-size="2gb" to your docker run command. The default is pretty small and can sometimes cause issues like this.

/cc @opprash and @KeepAndWin

Thank you for your advice. I check the default docker shm and find it's only 64M. When I change it up to 64g the script goes well. And I also try "deepspeed all_reduce_bench_v2.py", it exits successfully. Appreciate it for your answer.

@opprash
Copy link

opprash commented Jul 21, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat
Projects
None yet
Development

No branches or pull requests

4 participants