Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepspeed lr_schedule is None #25865

Closed
2 of 4 tasks
iMountTai opened this issue Aug 30, 2023 · 9 comments
Closed
2 of 4 tasks

Deepspeed lr_schedule is None #25865

iMountTai opened this issue Aug 30, 2023 · 9 comments
Labels

Comments

@iMountTai
Copy link

System Info

linux

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When I use DeepSpeed with the resume_from_checkpoint parameter, I am unable to resume properly. The lr_schedule doesn't load its state correctly, causing the learning rate to start from 0.

Expected behavior

resume from last checkpoint

@amyeroberts
Copy link
Collaborator

Hi @iMountTai, thanks for raising this issue.

So that we can help, could you follow the issue template and provide:

  • The running environment: run transformers-cli env in the terminal and copy-paste the output
  • A minimal code snippet that reproduces the issue

cc @pacman100

@iMountTai
Copy link
Author

iMountTai commented Aug 31, 2023

image
The script is a script, refer to transformers script for implementation. The difference is that I used DeepSpeed.

@pacman100
Copy link
Contributor

Hello, the PRs huggingface/accelerate#1909 ans #25863 are meant for resolving this issue. I'm working on it

@iMountTai
Copy link
Author

@pacman100 This doesn't solve my problem, the learning rate still starts at 0.

@pacman100
Copy link
Contributor

Hello, It's working for me just fine. Below is a simple reproducer for it:

  1. ds config file ds_config_zero3_ds_optim.json:
{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
  1. running the run_glue.py example from the transformers:
cd transformers
export TASK_NAME=mrpc
export CUDA_VISIBLE_DEVICES="0,1"

torchrun --nnodes 1 --nproc-per-node 2 ./examples/pytorch/text-classification/run_glue.py  --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 16   --learning_rate 5e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --deepspeed ~/transformers/tests/deepspeed/ds_config_zero3_ds_optim.json --lr_scheduler_type cosine --save_strategy "epoch" --evaluation_strategy "epoch" --logging_steps 1

Kill after 1st epoch.

  1. run from the checkpoint using --resume_from_checkpoint:
torchrun --nnodes 1 --nproc-per-node 2 ./examples/pytorch/text-classification/run_glue.py  --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 16   --learning_rate 5e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --deepspeed ~/transformers/tests/deepspeed/ds_config_zero3_ds_optim.json --lr_scheduler_type cosine --save_strategy "epoch" --evaluation_strategy "epoch" --logging_steps 1 --resume_from_checkpoint /tmp/$TASK_NAME/checkpoint-115
  1. lr and loss plots:
Screenshot 2023-09-01 at 12 02 09 PM

You can see that the learning rate is resuming from the previous value as expected. The training loss is also inline with the resumption from checkpoint.

  1. Make sure to use both the PRs from accelerate as well as transformers.

@iMountTai
Copy link
Author

iMountTai commented Sep 1, 2023

my deepspeed config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 100,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 1e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 1e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

The reason why the resume is not correct is that optimizer is not specified in my ds_config.

@iMountTai
Copy link
Author

With my ds_config, HF optimizer + HF scheduler is not supported?

@pacman100
Copy link
Contributor

Hello @iMountTai, I've fixed the issue, could you please retry? You can find all 4 combinations experiments to test the proper working of resume_from_checkpoint here: #25863 (comment)

@iMountTai
Copy link
Author

It worked. You are so excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants