Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama3 单机多卡不可训练,exits with return code = -7 #3695

Closed
1 task done
luolanfeixue opened this issue May 11, 2024 · 2 comments
Closed
1 task done

llama3 单机多卡不可训练,exits with return code = -7 #3695

luolanfeixue opened this issue May 11, 2024 · 2 comments
Labels
solved This problem has been already solved

Comments

@luolanfeixue
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

v.0.7.0稳定版本,运行脚本LLaMA-Factory/examples/lora_multi_gpu/ds_zero3.sh

仅仅将model_name_or_path 改为Meta-Llama-3-8B-Instruct

#!/bin/bash

deepspeed --num_gpus 4 ../../src/train_bash.py
--deepspeed ../deepspeed/ds_z3_config.json
--stage sft
--do_train \xxxx
--model_name_or_path xxxx/Meta-Llama-3-8B-Instruct
--dataset alpaca_gpt4_en,glaive_toolcall
--dataset_dir ../../data
--template default
--finetuning_type lora
--lora_target q_proj,v_proj
--output_dir ../../saves/LLaMA2-7B/lora/sft
--overwrite_cache
--overwrite_output_dir
--cutoff_len 1024
--preprocessing_num_workers 16
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 100
--eval_steps 100
--evaluation_strategy steps
--learning_rate 5e-5
--num_train_epochs 3.0
--max_samples 3000
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16

Expected behavior

num_gpus =1能正常运行,改为4后就不能正常运行。
报错:
[2024-05-11 11:50:57,413] [ERROR] [launch.py:321:sigkill_handler] ['/data/oceanus_ctr/j-xxx-jk/envs/llama/bin/python', '-u', '../../src/train_bash.py', '--local_rank=1', '--deepspeed', '../deepspeed/ds_z3_config.json', '--stage', 'sft', '--do_train', '--model_name_or_path', '../../Meta-Llama-3-8B-Instruct/', '--dataset', '话术质检_v1.1', '--dataset_dir', '../../data', '--template', 'llama3', '--finetuning_type', 'lora', '--lora_alpha', '256', '--lora_rank', '256', '--lora_dropout', '0.1', '--lora_target', 'q_proj,v_proj', '--output_dir', './saves/Meta-Llama-3-8B-Instruct/话术质检_v1.1_256_v2', '--overwrite_cache', '--overwrite_output_dir', '--cutoff_len', '2048', '--preprocessing_num_workers', '2', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '2', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--warmup_steps', '20', '--save_steps', '100', '--eval_steps', '100', '--evaluation_strategy', 'steps', '--learning_rate', '5e-5', '--num_train_epochs', '6.0', '--max_samples', '3000', '--val_size', '0.002', '--ddp_timeout', '180000000', '--plot_loss', '--fp16'] exits with return code = -7

调查过exits with return code = -7
#2061

但是docker 共享内存600G,cpu内存920G且,内存利用率并不高。

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.40.1
  • Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.31
  • Python version: 3.10.11
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.1
  • Accelerate version: 0.30.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented May 11, 2024

可能是内存没有被正确申请,-7 是内存溢出的问题

@hiyouga hiyouga added the solved This problem has been already solved label May 11, 2024
@hiyouga hiyouga closed this as completed May 11, 2024
@luolanfeixue
Copy link
Author

问题解决了,启动脚本的时候切换为root账号就可以多卡了。用户权限可能有些权限没有,导致申请内存出问题。

@luolanfeixue luolanfeixue changed the title llama3 单机多卡不可训练 llama3 单机多卡不可训练,exits with return code = -7 May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants