[REQUEST] Activation Checkpoint Prefetch #1575

ckddls1321 · 2021-11-19T07:19:15Z

Is your feature request related to a problem? Please describe.
Activation prefetch features to enlarge batch size on middle-size(100B~1T) of models

From DeepSpeedExamples repo, GPU throughput with Activation checkpoint in CPU matters.
On A100 server pods, Activation checkpoint in CPU perform worse because of synchronization (HtoD memcpy or all-gather when partitioned activation).
For Tensor Parallelism, it would be better to save activation checkpoint in partitioned tensor.

Describe the solution you'd like
Activation prefetch during backward re-computation based on configuration
Furthermore, to increase batch size in tensor parallelism, asynchronous all-gather prefetched partitioned activation is needed.

Describe alternatives you've considered
For large scale GPU pods more than 128 GPU would not be problem.
It would be appreciated if we have candidates of GPU cluster size based on model configuration (10B, 50B, 100B, 1T).

tjruwase · 2021-11-20T19:50:01Z

@ckddls1321, thanks for this feature request. However, it seems there are multiple requests contained here. Can you please itemize the requested features?
Also, does this #1254 address your request to partition saving of activation checkpoints in GPU for tensor parallelism?
Thanks!

ckddls1321 · 2021-11-23T01:59:02Z

@tjruwase, Sorry, my message was not clear.

Can DeepSpeed also prefetch activation checkpoint?

ckddls1321 added the enhancement New feature or request label Nov 19, 2021

tjruwase self-assigned this Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Activation Checkpoint Prefetch #1575

[REQUEST] Activation Checkpoint Prefetch #1575

ckddls1321 commented Nov 19, 2021

tjruwase commented Nov 20, 2021

ckddls1321 commented Nov 23, 2021

[REQUEST] Activation Checkpoint Prefetch #1575

[REQUEST] Activation Checkpoint Prefetch #1575

Comments

ckddls1321 commented Nov 19, 2021

tjruwase commented Nov 20, 2021

ckddls1321 commented Nov 23, 2021