Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Activation Checkpoint Prefetch #1575

Open
ckddls1321 opened this issue Nov 19, 2021 · 2 comments
Open

[REQUEST] Activation Checkpoint Prefetch #1575

ckddls1321 opened this issue Nov 19, 2021 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@ckddls1321
Copy link

Is your feature request related to a problem? Please describe.
Activation prefetch features to enlarge batch size on middle-size(100B~1T) of models

  • From DeepSpeedExamples repo, GPU throughput with Activation checkpoint in CPU matters.
  • On A100 server pods, Activation checkpoint in CPU perform worse because of synchronization (HtoD memcpy or all-gather when partitioned activation).
  • For Tensor Parallelism, it would be better to save activation checkpoint in partitioned tensor.

Describe the solution you'd like
Activation prefetch during backward re-computation based on configuration
Furthermore, to increase batch size in tensor parallelism, asynchronous all-gather prefetched partitioned activation is needed.

Describe alternatives you've considered
For large scale GPU pods more than 128 GPU would not be problem.
It would be appreciated if we have candidates of GPU cluster size based on model configuration (10B, 50B, 100B, 1T).

@ckddls1321 ckddls1321 added the enhancement New feature or request label Nov 19, 2021
@tjruwase
Copy link
Contributor

@ckddls1321, thanks for this feature request. However, it seems there are multiple requests contained here. Can you please itemize the requested features?
Also, does this #1254 address your request to partition saving of activation checkpoints in GPU for tensor parallelism?
Thanks!

@tjruwase tjruwase self-assigned this Nov 20, 2021
@ckddls1321
Copy link
Author

@tjruwase, Sorry, my message was not clear.

  1. Can DeepSpeed also prefetch activation checkpoint?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants