Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there an out-of-the-box test example for ZeRO-Inifity? #1253

Closed
weberxie opened this issue Jul 26, 2021 · 6 comments · Fixed by #1254
Closed

Is there an out-of-the-box test example for ZeRO-Inifity? #1253

weberxie opened this issue Jul 26, 2021 · 6 comments · Fixed by #1254

Comments

@weberxie
Copy link

Hi team, I'm going to test ZeRO-Inifity feature especially the activation checkpointing, Is there an out-of-the-box test example for ZeRO-Inifity?

Thanks.

@weberxie
Copy link
Author

I was running the example https://github.com/microsoft/DeepSpeedExamples/tree/master/bing_bert , by modifing the configure,

for deepspeed_bsz32k_lamb_config_seq512.json
`
{
"train_batch_size": 2048,
"train_micro_batch_size_per_gpu": 16,
"steps_per_print": 1000,
"prescale_gradients": false,
"gradient_clipping": 1.0,
"optimizer": {
"type": "Adam",
"params": {
"lr": 2e-3,
"weight_decay": 0.01
}
},
"wall_clock_breakdown": false,

"zero_allow_untested_optimizer" : true,

"fp16": {
"enabled": true,
"loss_scale": 0
},
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients" : true,
"offload_optimizer": {
"device": "cpu",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
},
"offload_param": {
"device": "cpu",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"stage3_max_live_parameters" : 1e9,
"stage3_max_reuse_distance" : 1e9,
"stage3_prefetch_bucket_size" : 5e8,
"stage3_param_persistence_threshold" : 1e6,
"sub_group_size" : 1e12,
"elastic_checkpoint" : true,
"stage3_gather_fp16_weights_on_model_save": true,
"ignore_unused_parameters": true
},
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
}
`

and for

diff --git a/bing_bert/bert_large_lamb_nvidia_data.json b/bing_bert/bert_large_lamb_nvidia_data.json

@@ -4,8 +4,8 @@
"bert_model_file": "bert-large-uncased",
"bert_model_config": {
"vocab_size_or_config_json_file": 119547,
- "hidden_size": 1024,
- "num_hidden_layers": 24,
+ "hidden_size": 4096,
+ "num_hidden_layers": 48,
"num_attention_heads": 16,
"intermediate_size": 4096,
"hidden_act": "gelu",

And this example will be OOM on A100 node with 8 * 40GB GPUs.

@weberxie
Copy link
Author

It seems that the we can only offload partitioned activations to CPU when the model is parallel.

@tjruwase
Copy link
Contributor

It seems that the we can only offload partitioned activations to CPU when the model is parallel.

Can you please clarify what you mean by the "model is parallel"? Do you mean when tensor slicing is enabled (MP>1)?

@weberxie
Copy link
Author

Yes.

As stated in the documentation https://www.deepspeed.ai/docs/config-json/#activation-checkpointing - “Enables partition activation when used with model parallelism”, Does it mean that CPU Checkpointing can only be used in model parallel mode?

Thank you so much.

@tjruwase
Copy link
Contributor

@weberxie, yes you are correct. The current implementation has the requirement, which I don't think is necessary. I am working on fixing to make CPU checkpointing independent of partition activation.

@weberxie
Copy link
Author

@tjruwase Thanks for your confirmation!

@tjruwase tjruwase linked a pull request Jul 28, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants