We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v.0.7.0稳定版本,运行脚本LLaMA-Factory/examples/lora_multi_gpu/ds_zero3.sh
仅仅将model_name_or_path 改为Meta-Llama-3-8B-Instruct
#!/bin/bash
deepspeed --num_gpus 4 ../../src/train_bash.py --deepspeed ../deepspeed/ds_z3_config.json --stage sft --do_train \xxxx --model_name_or_path xxxx/Meta-Llama-3-8B-Instruct --dataset alpaca_gpt4_en,glaive_toolcall --dataset_dir ../../data --template default --finetuning_type lora --lora_target q_proj,v_proj --output_dir ../../saves/LLaMA2-7B/lora/sft --overwrite_cache --overwrite_output_dir --cutoff_len 1024 --preprocessing_num_workers 16 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --logging_steps 10 --warmup_steps 20 --save_steps 100 --eval_steps 100 --evaluation_strategy steps --learning_rate 5e-5 --num_train_epochs 3.0 --max_samples 3000 --val_size 0.1 --ddp_timeout 180000000 --plot_loss --fp16
num_gpus =1能正常运行,改为4后就不能正常运行。 报错: [2024-05-11 11:50:57,413] [ERROR] [launch.py:321:sigkill_handler] ['/data/oceanus_ctr/j-xxx-jk/envs/llama/bin/python', '-u', '../../src/train_bash.py', '--local_rank=1', '--deepspeed', '../deepspeed/ds_z3_config.json', '--stage', 'sft', '--do_train', '--model_name_or_path', '../../Meta-Llama-3-8B-Instruct/', '--dataset', '话术质检_v1.1', '--dataset_dir', '../../data', '--template', 'llama3', '--finetuning_type', 'lora', '--lora_alpha', '256', '--lora_rank', '256', '--lora_dropout', '0.1', '--lora_target', 'q_proj,v_proj', '--output_dir', './saves/Meta-Llama-3-8B-Instruct/话术质检_v1.1_256_v2', '--overwrite_cache', '--overwrite_output_dir', '--cutoff_len', '2048', '--preprocessing_num_workers', '2', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '2', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--warmup_steps', '20', '--save_steps', '100', '--eval_steps', '100', '--evaluation_strategy', 'steps', '--learning_rate', '5e-5', '--num_train_epochs', '6.0', '--max_samples', '3000', '--val_size', '0.002', '--ddp_timeout', '180000000', '--plot_loss', '--fp16'] exits with return code = -7
调查过exits with return code = -7 #2061
但是docker 共享内存600G,cpu内存920G且,内存利用率并不高。
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
No response
The text was updated successfully, but these errors were encountered:
可能是内存没有被正确申请,-7 是内存溢出的问题
Sorry, something went wrong.
问题解决了,启动脚本的时候切换为root账号就可以多卡了。用户权限可能有些权限没有,导致申请内存出问题。
No branches or pull requests
Reminder
Reproduction
v.0.7.0稳定版本,运行脚本LLaMA-Factory/examples/lora_multi_gpu/ds_zero3.sh
仅仅将model_name_or_path 改为Meta-Llama-3-8B-Instruct
#!/bin/bash
deepspeed --num_gpus 4 ../../src/train_bash.py
--deepspeed ../deepspeed/ds_z3_config.json
--stage sft
--do_train \xxxx
--model_name_or_path xxxx/Meta-Llama-3-8B-Instruct
--dataset alpaca_gpt4_en,glaive_toolcall
--dataset_dir ../../data
--template default
--finetuning_type lora
--lora_target q_proj,v_proj
--output_dir ../../saves/LLaMA2-7B/lora/sft
--overwrite_cache
--overwrite_output_dir
--cutoff_len 1024
--preprocessing_num_workers 16
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 100
--eval_steps 100
--evaluation_strategy steps
--learning_rate 5e-5
--num_train_epochs 3.0
--max_samples 3000
--val_size 0.1
--ddp_timeout 180000000
--plot_loss
--fp16
Expected behavior
num_gpus =1能正常运行,改为4后就不能正常运行。
报错:
[2024-05-11 11:50:57,413] [ERROR] [launch.py:321:sigkill_handler] ['/data/oceanus_ctr/j-xxx-jk/envs/llama/bin/python', '-u', '../../src/train_bash.py', '--local_rank=1', '--deepspeed', '../deepspeed/ds_z3_config.json', '--stage', 'sft', '--do_train', '--model_name_or_path', '../../Meta-Llama-3-8B-Instruct/', '--dataset', '话术质检_v1.1', '--dataset_dir', '../../data', '--template', 'llama3', '--finetuning_type', 'lora', '--lora_alpha', '256', '--lora_rank', '256', '--lora_dropout', '0.1', '--lora_target', 'q_proj,v_proj', '--output_dir', './saves/Meta-Llama-3-8B-Instruct/话术质检_v1.1_256_v2', '--overwrite_cache', '--overwrite_output_dir', '--cutoff_len', '2048', '--preprocessing_num_workers', '2', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '2', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--warmup_steps', '20', '--save_steps', '100', '--eval_steps', '100', '--evaluation_strategy', 'steps', '--learning_rate', '5e-5', '--num_train_epochs', '6.0', '--max_samples', '3000', '--val_size', '0.002', '--ddp_timeout', '180000000', '--plot_loss', '--fp16'] exits with return code = -7
调查过exits with return code = -7
#2061
但是docker 共享内存600G,cpu内存920G且,内存利用率并不高。
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.40.1Others
No response
The text was updated successfully, but these errors were encountered: