You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Activation prefetch features to enlarge batch size on middle-size(100B~1T) of models
From DeepSpeedExamples repo, GPU throughput with Activation checkpoint in CPU matters.
On A100 server pods, Activation checkpoint in CPU perform worse because of synchronization (HtoD memcpy or all-gather when partitioned activation).
For Tensor Parallelism, it would be better to save activation checkpoint in partitioned tensor.
Describe the solution you'd like
Activation prefetch during backward re-computation based on configuration
Furthermore, to increase batch size in tensor parallelism, asynchronous all-gather prefetched partitioned activation is needed.
Describe alternatives you've considered
For large scale GPU pods more than 128 GPU would not be problem.
It would be appreciated if we have candidates of GPU cluster size based on model configuration (10B, 50B, 100B, 1T).
The text was updated successfully, but these errors were encountered:
@ckddls1321, thanks for this feature request. However, it seems there are multiple requests contained here. Can you please itemize the requested features?
Also, does this #1254 address your request to partition saving of activation checkpoints in GPU for tensor parallelism?
Thanks!
Is your feature request related to a problem? Please describe.
Activation prefetch features to enlarge batch size on middle-size(100B~1T) of models
Describe the solution you'd like
Activation prefetch during backward re-computation based on configuration
Furthermore, to increase batch size in tensor parallelism, asynchronous all-gather prefetched partitioned activation is needed.
Describe alternatives you've considered
For large scale GPU pods more than 128 GPU would not be problem.
It would be appreciated if we have candidates of GPU cluster size based on model configuration (10B, 50B, 100B, 1T).
The text was updated successfully, but these errors were encountered: