Deepspeed lr_schedule is None #25865

iMountTai · 2023-08-30T12:05:52Z

System Info

linux

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

When I use DeepSpeed with the resume_from_checkpoint parameter, I am unable to resume properly. The lr_schedule doesn't load its state correctly, causing the learning rate to start from 0.

Expected behavior

resume from last checkpoint

The text was updated successfully, but these errors were encountered:

amyeroberts · 2023-08-30T14:53:26Z

Hi @iMountTai, thanks for raising this issue.

So that we can help, could you follow the issue template and provide:

The running environment: run transformers-cli env in the terminal and copy-paste the output
A minimal code snippet that reproduces the issue

cc @pacman100

iMountTai · 2023-08-31T09:27:42Z

The script is a script, refer to transformers script for implementation. The difference is that I used DeepSpeed.

pacman100 · 2023-08-31T09:50:09Z

Hello, the PRs huggingface/accelerate#1909 ans #25863 are meant for resolving this issue. I'm working on it

iMountTai · 2023-09-01T05:46:33Z

@pacman100 This doesn't solve my problem, the learning rate still starts at 0.

pacman100 · 2023-09-01T06:33:32Z

Hello, It's working for me just fine. Below is a simple reproducer for it:

ds config file ds_config_zero3_ds_optim.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

running the run_glue.py example from the transformers:

cd transformers
export TASK_NAME=mrpc
export CUDA_VISIBLE_DEVICES="0,1"

torchrun --nnodes 1 --nproc-per-node 2 ./examples/pytorch/text-classification/run_glue.py  --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 16   --learning_rate 5e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --deepspeed ~/transformers/tests/deepspeed/ds_config_zero3_ds_optim.json --lr_scheduler_type cosine --save_strategy "epoch" --evaluation_strategy "epoch" --logging_steps 1

Kill after 1st epoch.

run from the checkpoint using --resume_from_checkpoint:

torchrun --nnodes 1 --nproc-per-node 2 ./examples/pytorch/text-classification/run_glue.py  --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 16   --learning_rate 5e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --deepspeed ~/transformers/tests/deepspeed/ds_config_zero3_ds_optim.json --lr_scheduler_type cosine --save_strategy "epoch" --evaluation_strategy "epoch" --logging_steps 1 --resume_from_checkpoint /tmp/$TASK_NAME/checkpoint-115

lr and loss plots:

You can see that the learning rate is resuming from the previous value as expected. The training loss is also inline with the resumption from checkpoint.

Make sure to use both the PRs from accelerate as well as transformers.

iMountTai · 2023-09-01T07:17:39Z

my deepspeed config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 100,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 1e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 1e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

The reason why the resume is not correct is that optimizer is not specified in my ds_config.

iMountTai · 2023-09-01T07:27:13Z

With my ds_config, HF optimizer + HF scheduler is not supported?

pacman100 · 2023-09-01T21:02:37Z

Hello @iMountTai, I've fixed the issue, could you please retry? You can find all 4 combinations experiments to test the proper working of resume_from_checkpoint here: #25863 (comment)

iMountTai · 2023-09-04T13:23:37Z

It worked. You are so excellent!

pacman100 mentioned this issue Sep 1, 2023

Forget to put the global-step in lr scheduler in train.py #25079

Closed

4 tasks

pacman100 added the solved label Sep 29, 2023

pacman100 closed this as completed Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepspeed lr_schedule is None #25865

Deepspeed lr_schedule is None #25865

iMountTai commented Aug 30, 2023

amyeroberts commented Aug 30, 2023

iMountTai commented Aug 31, 2023 •

edited

Loading

pacman100 commented Aug 31, 2023

iMountTai commented Sep 1, 2023

pacman100 commented Sep 1, 2023

iMountTai commented Sep 1, 2023 •

edited

Loading

iMountTai commented Sep 1, 2023

pacman100 commented Sep 1, 2023

iMountTai commented Sep 4, 2023

Deepspeed lr_schedule is None #25865

Deepspeed lr_schedule is None #25865

Comments

iMountTai commented Aug 30, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Aug 30, 2023

iMountTai commented Aug 31, 2023 • edited Loading

pacman100 commented Aug 31, 2023

iMountTai commented Sep 1, 2023

pacman100 commented Sep 1, 2023

iMountTai commented Sep 1, 2023 • edited Loading

iMountTai commented Sep 1, 2023

pacman100 commented Sep 1, 2023

iMountTai commented Sep 4, 2023

iMountTai commented Aug 31, 2023 •

edited

Loading

iMountTai commented Sep 1, 2023 •

edited

Loading