Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss will suddenly turn 0 during SFT #298

Closed
zhangyx0417 opened this issue Aug 13, 2023 · 2 comments
Closed

Loss will suddenly turn 0 during SFT #298

zhangyx0417 opened this issue Aug 13, 2023 · 2 comments

Comments

@zhangyx0417
Copy link

Below is my training script:

torchrun --nproc_per_node=4 --master_port=28636 train.py \
    --model_name_or_path "/nas/models/llama_hf/7B/" \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir ./outputs/ \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

During training, the loss will suddenly turn 0:

{'loss': 1.0645, 'learning_rate': 1.893588419088962e-05, 'epoch': 0.52}                                                                         
{'loss': 1.157, 'learning_rate': 1.892391168466452e-05, 'epoch': 0.52}                                                                          
{'loss': 1.1938, 'learning_rate': 1.891187603111447e-05, 'epoch': 0.53}                                                                         
{'loss': 0.0, 'learning_rate': 1.8899777315406073e-05, 'epoch': 0.53}                                                                           
{'loss': 0.0, 'learning_rate': 1.8887615623152188e-05, 'epoch': 0.53}                                                                           
{'loss': 0.0, 'learning_rate': 1.88753910404113e-05, 'epoch': 0.53}

I've tried a total of 4 times, but the problem persists, it's just that the epoch at which the loss goes to 0 is different.

@RManLuo
Copy link

RManLuo commented Aug 28, 2023

I got the same problem when using SFT. The loss goes to 0 after the first epoch. But it seems fine when I use Lora for training.

@phosphorylation
Copy link

I ran into similar issue and this solved it for me shibing624/MedicalGPT#125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants