Replies: 1 comment
-
I manually copied the offload tensors into the NVME path. The problem is fixed. But the losses resuming from the previous checkpoint are inconsistency with the losses before. Not sure where the problem is. I guess the trainer state was not completely restored from the saved checkpoint. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The Zero3 offload config is:
The training with this config works well. But when resuming from the checkpoint with the transformers Trainer, problems occurred.
The problem seems to begin from the following line:
DeepSpeed/deepspeed/runtime/engine.py
Line 2756 in 2fc702e
Beta Was this translation helpful? Give feedback.
All reactions