llama3 单机多卡不可训练，exits with return code = -7 #3695

luolanfeixue · 2024-05-11T09:47:07Z

Reminder

I have read the README and searched the existing issues.

Reproduction

v.0.7.0稳定版本，运行脚本LLaMA-Factory/examples/lora_multi_gpu/ds_zero3.sh

仅仅将model_name_or_path 改为Meta-Llama-3-8B-Instruct

#!/bin/bash

deepspeed --num_gpus 4 ../../src/train_bash.py
--deepspeed ../deepspeed/ds_z3_config.json
--stage sft
--do_train \xxxx
--model_name_or_path xxxx/Meta-Llama-3-8B-Instruct
--dataset alpaca_gpt4_en,glaive_toolcall
--dataset_dir ../../data
--template default
--finetuning_type lora
--lora_target q_proj,v_proj
--output_dir ../../saves/LLaMA2-7B/lora/sft
--overwrite_cache
--overwrite_output_dir
--cutoff_len 1024
--preprocessing_num_workers 16
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 100
--eval_steps 100
--evaluation_strategy steps
--learning_rate 5e-5
--num_train_epochs 3.0
--max_samples 3000
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16

Expected behavior

num_gpus =1能正常运行，改为4后就不能正常运行。
报错：
[2024-05-11 11:50:57,413] [ERROR] [launch.py:321:sigkill_handler] ['/data/oceanus_ctr/j-xxx-jk/envs/llama/bin/python', '-u', '../../src/train_bash.py', '--local_rank=1', '--deepspeed', '../deepspeed/ds_z3_config.json', '--stage', 'sft', '--do_train', '--model_name_or_path', '../../Meta-Llama-3-8B-Instruct/', '--dataset', '话术质检_v1.1', '--dataset_dir', '../../data', '--template', 'llama3', '--finetuning_type', 'lora', '--lora_alpha', '256', '--lora_rank', '256', '--lora_dropout', '0.1', '--lora_target', 'q_proj,v_proj', '--output_dir', './saves/Meta-Llama-3-8B-Instruct/话术质检_v1.1_256_v2', '--overwrite_cache', '--overwrite_output_dir', '--cutoff_len', '2048', '--preprocessing_num_workers', '2', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '2', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--warmup_steps', '20', '--save_steps', '100', '--eval_steps', '100', '--evaluation_strategy', 'steps', '--learning_rate', '5e-5', '--num_train_epochs', '6.0', '--max_samples', '3000', '--val_size', '0.002', '--ddp_timeout', '180000000', '--plot_loss', '--fp16'] exits with return code = -7

调查过exits with return code = -7
#2061

但是docker 共享内存600G，cpu内存920G且，内存利用率并不高。

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.40.1
Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.31
Python version: 3.10.11
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.1
Accelerate version: 0.30.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-05-11T13:56:46Z

可能是内存没有被正确申请，-7 是内存溢出的问题

luolanfeixue · 2024-05-27T03:17:37Z

问题解决了，启动脚本的时候切换为root账号就可以多卡了。用户权限可能有些权限没有，导致申请内存出问题。

hiyouga added the solved This problem has been already solved label May 11, 2024

hiyouga closed this as completed May 11, 2024

luolanfeixue changed the title ~~llama3 单机多卡不可训练~~ llama3 单机多卡不可训练，exits with return code = -7 May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3 单机多卡不可训练，exits with return code = -7 #3695

llama3 单机多卡不可训练，exits with return code = -7 #3695

luolanfeixue commented May 11, 2024

hiyouga commented May 11, 2024

luolanfeixue commented May 27, 2024

llama3 单机多卡不可训练，exits with return code = -7 #3695

llama3 单机多卡不可训练，exits with return code = -7 #3695

Comments

luolanfeixue commented May 11, 2024

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented May 11, 2024

luolanfeixue commented May 27, 2024