Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] exits with return code = -11 #3989

Closed
zxjia2002 opened this issue Jul 19, 2023 · 6 comments
Closed

[BUG] exits with return code = -11 #3989

zxjia2002 opened this issue Jul 19, 2023 · 6 comments
Assignees
Labels
bug Something isn't working training

Comments

@zxjia2002
Copy link

zxjia2002 commented Jul 19, 2023

Describe the bug
I have run the code successfully on a machine with x4 1080Tis. However, when I ran the same code on a machine with x2 3090s, deepspeed report Kill subprocess after [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect). In the end, exits with return code = -11 is prompted.

ds_report output
image

Screenshots

[2023-07-19 10:43:05,393] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:06,114] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-07-19 10:43:06,115] [INFO] [runner.py:555:main] cmd = /home/hitwh2021/anaconda3/envs/bit/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None src/clip_fine_tune_deepspeed.py --dataset CIRR --api-key HoWzEpTy4klumwh44YcBem6Ia --workspace keepandwin --experiment-name general --num-epoch 2 --clip-model-name RN50x4 --encoder both --learning-rate 2e-6 --batch-size 128 --transform targetpad --target-ratio 1.25 --save-training --save-best --validation-frequency 1 --deepspeed-config ./ds_config.json
[2023-07-19 10:43:06,881] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:07,262] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-07-19 10:43:07,262] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-07-19 10:43:07,262] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-07-19 10:43:07,262] [INFO] [launch.py:163:main] dist_world_size=2
[2023-07-19 10:43:07,262] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-07-19 10:43:08,434] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:08,442] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:09,278] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 32047
[2023-07-19 10:43:09,278] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 32048
[2023-07-19 10:43:09,286] [ERROR] [launch.py:321:sigkill_handler] ['/home/hitwh2021/anaconda3/envs/bit/bin/python', '-u', 'src/clip_fine_tune_deepspeed.py', '--local_rank=1', '--dataset', 'CIRR', '--api-key', 'HoWzEpTy4klumwh44YcBem6Ia', '--workspace', 'keepandwin', '--experiment-name', 'general', '--num-epoch', '2', '--clip-model-name', 'RN50x4', '--encoder', 'both', '--learning-rate', '2e-6', '--batch-size', '128', '--transform', 'targetpad', '--target-ratio', '1.25', '--save-training', '--save-best', '--validation-frequency', '1', '--deepspeed-config', './ds_config.json'] exits with return code = -11

System info (please complete the following information):

  • OS: Ubuntu 18.04
  • GPU count and types a machine with x2 3090s
  • Python version 3.8.16
@zxjia2002 zxjia2002 added bug Something isn't working training labels Jul 19, 2023
@jomayeri jomayeri self-assigned this Jul 19, 2023
@jomayeri
Copy link
Contributor

@KeepAndWin, unfortunately I cannot repro this error because I do not have access to those specific GPU types. Most likely there is something incompatible with Cuda or the built in ops on that second device. I suggest trying basic pytorch+cuda first to ensure that works.

@jeffra
Copy link
Collaborator

jeffra commented Jul 20, 2023

@KeepAndWin please see this thread for latest discussion on this: #4002

@jeffra
Copy link
Collaborator

jeffra commented Jul 21, 2023

It seems both -7 and -11 are related to shared memory issues with docker. Please see this reply that has fixed other people's recent issues: #4002 (comment)

@jomayeri
Copy link
Contributor

jomayeri commented Aug 4, 2023

Closing for now.

@jomayeri jomayeri closed this as completed Aug 4, 2023
@Anonymousplendid
Copy link

Closing for now.

So how do you solve the issue?

@Coderella-z
Copy link

Closing for now.

So how do you solve the issue?

你解决这个问题了吗?我也遇到了这个问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

5 participants