The resume from checkpoint seems not working with Zero3-offload on NVME. #5589

bug-fixed · 2024-05-30T12:20:00Z

bug-fixed
May 30, 2024

The Zero3 offload config is:

{
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/lscratch/27396327/offload/optimizer",
      "pin_memory": true,
      "ratio": 1,
      "buffer_count": 4,
      "fast_init": false
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": 1e9,
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "gather_16bit_weights_on_model_save": true
  },
  "aio": {
    "block_size": 10485760,
    "queue_depth": 8,
    "thread_count": 4,
    "single_submit": false,
    "overlap_events": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "steps_per_print": 1e5,
  "wall_clock_breakdown": false
}

The training with this config works well. But when resuming from the checkpoint with the transformers Trainer, problems occurred.

trainer.train(resume_from_checkpoint=True)

The problem seems to begin from the following line:

DeepSpeed/deepspeed/runtime/engine.py

Line 2756 in 2fc702e

load_path, client_states = self._load_checkpoint(load_dir,

bug-fixed · 2024-05-30T12:24:52Z

bug-fixed
May 30, 2024
Author

I manually copied the offload tensors into the NVME path. The problem is fixed. But the losses resuming from the previous checkpoint are inconsistency with the losses before. Not sure where the problem is. I guess the trainer state was not completely restored from the saved checkpoint.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The resume from checkpoint seems not working with Zero3-offload on NVME. #5589

{{title}}

Replies: 1 comment

{{title}}

Select a reply

The resume from checkpoint seems not working with Zero3-offload on NVME. #5589

bug-fixed May 30, 2024

Replies: 1 comment

bug-fixed May 30, 2024 Author

bug-fixed
May 30, 2024

bug-fixed
May 30, 2024
Author