We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Below is my training script:
torchrun --nproc_per_node=4 --master_port=28636 train.py \ --model_name_or_path "/nas/models/llama_hf/7B/" \ --data_path ./alpaca_data.json \ --bf16 True \ --output_dir ./outputs/ \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True
During training, the loss will suddenly turn 0:
{'loss': 1.0645, 'learning_rate': 1.893588419088962e-05, 'epoch': 0.52} {'loss': 1.157, 'learning_rate': 1.892391168466452e-05, 'epoch': 0.52} {'loss': 1.1938, 'learning_rate': 1.891187603111447e-05, 'epoch': 0.53} {'loss': 0.0, 'learning_rate': 1.8899777315406073e-05, 'epoch': 0.53} {'loss': 0.0, 'learning_rate': 1.8887615623152188e-05, 'epoch': 0.53} {'loss': 0.0, 'learning_rate': 1.88753910404113e-05, 'epoch': 0.53}
I've tried a total of 4 times, but the problem persists, it's just that the epoch at which the loss goes to 0 is different.
The text was updated successfully, but these errors were encountered:
I got the same problem when using SFT. The loss goes to 0 after the first epoch. But it seems fine when I use Lora for training.
Sorry, something went wrong.
I ran into similar issue and this solved it for me shibing624/MedicalGPT#125
No branches or pull requests
Below is my training script:
During training, the loss will suddenly turn 0:
I've tried a total of 4 times, but the problem persists, it's just that the epoch at which the loss goes to 0 is different.
The text was updated successfully, but these errors were encountered: