-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there an out-of-the-box test example for ZeRO-Inifity? #1253
Comments
I was running the example https://github.com/microsoft/DeepSpeedExamples/tree/master/bing_bert , by modifing the configure, for deepspeed_bsz32k_lamb_config_seq512.json "zero_allow_untested_optimizer" : true, "fp16": { and for diff --git a/bing_bert/bert_large_lamb_nvidia_data.json b/bing_bert/bert_large_lamb_nvidia_data.json @@ -4,8 +4,8 @@ And this example will be OOM on A100 node with 8 * 40GB GPUs. |
It seems that the we can only offload partitioned activations to CPU when the model is parallel. |
Can you please clarify what you mean by the "model is parallel"? Do you mean when tensor slicing is enabled (MP>1)? |
Yes. As stated in the documentation https://www.deepspeed.ai/docs/config-json/#activation-checkpointing - “Enables partition activation when used with model parallelism”, Does it mean that CPU Checkpointing can only be used in model parallel mode? Thank you so much. |
@weberxie, yes you are correct. The current implementation has the requirement, which I don't think is necessary. I am working on fixing to make CPU checkpointing independent of partition activation. |
@tjruwase Thanks for your confirmation! |
Hi team, I'm going to test ZeRO-Inifity feature especially the activation checkpointing, Is there an out-of-the-box test example for ZeRO-Inifity?
Thanks.
The text was updated successfully, but these errors were encountered: