Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dist][Pass] ZeRO optimization #23

Merged
merged 20 commits into from
Apr 23, 2022
Merged

Conversation

zhen-jia
Copy link
Contributor

@zhen-jia zhen-jia commented Apr 18, 2022

Description

This PR performs all ZeRO related optimizations. There are two main optimizations and two passes are impacted. May changes are listed below:

  1. Modify the partition gradient pass: after modification, the reduce_scatters are grouped and issued together for ZeRO-2. ZeRO-1 is remain the same
  2. Implement a new pass, i.e., group_allgather. This pass detect the computation pattern in ZeRO and group the corresponding ops. Including cast and allgather. The group_allgather updates the parameter in-placed. So no extra add zero will be launched.
  3. A new field is added to distributed context: group_bucket_size. This is an hyper parameter to control the group size in group communications. The default value is 5000000000, which is in consistent with DeepSpeed. It can be set to 0 or 1 so as to not group any communications at all.

Checklist

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

cc @awslabs/raf-reviewer

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. One insight I got from this PR is that we really need an ANF pattern matcher...

Other than the comments I left, there are two points

  1. Please be ware of [Dist] Refactor dist context & communicator and add inter-process communication with file #20 that changes dist_context and will result in conflicts (please also help review that PR and get it merged).
  2. @hzfan @zachzzc @whbldhwj @hgt312 @yzhliu please help review this PR as well.

@zhen-jia
Copy link
Contributor Author

Got it, Thanks! Will modify accordingly.

Overall LGTM. One insight I got from this PR is that we really need an ANF pattern matcher...

Other than the comments I left, there are two points

  1. Please be ware of [Dist] Refactor dist context & communicator and add inter-process communication with file #20 that changes dist_context and will result in conflicts (please also help review that PR and get it merged).
  2. @hzfan @zachzzc @whbldhwj @hgt312 @yzhliu please help review this PR as well.

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@comaniac comaniac merged commit a14d493 into awslabs:main Apr 23, 2022
@comaniac
Copy link
Contributor

Thanks @zhen-jia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants