Is there an out-of-the-box test example for ZeRO-Inifity? #1253

weberxie · 2021-07-26T09:28:52Z

Hi team, I'm going to test ZeRO-Inifity feature especially the activation checkpointing, Is there an out-of-the-box test example for ZeRO-Inifity?

Thanks.

weberxie · 2021-07-26T09:37:08Z

I was running the example https://github.com/microsoft/DeepSpeedExamples/tree/master/bing_bert , by modifing the configure,

for deepspeed_bsz32k_lamb_config_seq512.json
`
{
"train_batch_size": 2048,
"train_micro_batch_size_per_gpu": 16,
"steps_per_print": 1000,
"prescale_gradients": false,
"gradient_clipping": 1.0,
"optimizer": {
"type": "Adam",
"params": {
"lr": 2e-3,
"weight_decay": 0.01
}
},
"wall_clock_breakdown": false,

"zero_allow_untested_optimizer" : true,

"fp16": {
"enabled": true,
"loss_scale": 0
},
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients" : true,
"offload_optimizer": {
"device": "cpu",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
},
"offload_param": {
"device": "cpu",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"stage3_max_live_parameters" : 1e9,
"stage3_max_reuse_distance" : 1e9,
"stage3_prefetch_bucket_size" : 5e8,
"stage3_param_persistence_threshold" : 1e6,
"sub_group_size" : 1e12,
"elastic_checkpoint" : true,
"stage3_gather_fp16_weights_on_model_save": true,
"ignore_unused_parameters": true
},
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
}
`

and for

diff --git a/bing_bert/bert_large_lamb_nvidia_data.json b/bing_bert/bert_large_lamb_nvidia_data.json

@@ -4,8 +4,8 @@
"bert_model_file": "bert-large-uncased",
"bert_model_config": {
"vocab_size_or_config_json_file": 119547,
- "hidden_size": 1024,
- "num_hidden_layers": 24,
+ "hidden_size": 4096,
+ "num_hidden_layers": 48,
"num_attention_heads": 16,
"intermediate_size": 4096,
"hidden_act": "gelu",

And this example will be OOM on A100 node with 8 * 40GB GPUs.

weberxie · 2021-07-26T09:39:27Z

It seems that the we can only offload partitioned activations to CPU when the model is parallel.

tjruwase · 2021-07-26T13:26:40Z

It seems that the we can only offload partitioned activations to CPU when the model is parallel.

Can you please clarify what you mean by the "model is parallel"? Do you mean when tensor slicing is enabled (MP>1)?

weberxie · 2021-07-26T13:35:46Z

Yes.

As stated in the documentation https://www.deepspeed.ai/docs/config-json/#activation-checkpointing - “Enables partition activation when used with model parallelism”, Does it mean that CPU Checkpointing can only be used in model parallel mode?

Thank you so much.

tjruwase · 2021-07-27T18:21:54Z

@weberxie, yes you are correct. The current implementation has the requirement, which I don't think is necessary. I am working on fixing to make CPU checkpointing independent of partition activation.

weberxie · 2021-07-28T12:39:00Z

@tjruwase Thanks for your confirmation!

weberxie closed this as completed Jul 28, 2021

tjruwase linked a pull request Jul 28, 2021 that will close this issue

Activation checkpointing improvements #1254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there an out-of-the-box test example for ZeRO-Inifity? #1253

Is there an out-of-the-box test example for ZeRO-Inifity? #1253

weberxie commented Jul 26, 2021

weberxie commented Jul 26, 2021

weberxie commented Jul 26, 2021

tjruwase commented Jul 26, 2021

weberxie commented Jul 26, 2021

tjruwase commented Jul 27, 2021

weberxie commented Jul 28, 2021

Is there an out-of-the-box test example for ZeRO-Inifity? #1253

Is there an out-of-the-box test example for ZeRO-Inifity? #1253

Comments

weberxie commented Jul 26, 2021

weberxie commented Jul 26, 2021

weberxie commented Jul 26, 2021

tjruwase commented Jul 26, 2021

weberxie commented Jul 26, 2021

tjruwase commented Jul 27, 2021

weberxie commented Jul 28, 2021