Skip to content

Commit

Permalink
DS-Ulysses formating (#4204)
Browse files Browse the repository at this point in the history
* fix identation

* fix formatting

---------

Co-authored-by: Jeff Rasley <[email protected]>
  • Loading branch information
samadejacobs and jeffra authored Aug 24, 2023
1 parent 3e82cb6 commit 961827b
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 4 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[![License Apache 2.0](https://badgen.net/badge/license/apache2.0/blue)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
[![PyPI version](https://badge.fury.io/py/deepspeed.svg)](https://pypi.org/project/deepspeed/)
[![Downloads](https://static.pepy.tech/badge/deepspeed)](https://pepy.tech/project/deepspeed)
[![Downloads](https://static.pepy.tech/badge/deepspeed)](https://pepy.tech/project/deepspeed)
[![Build](https://badgen.net/badge/build/check-status/blue)](#build-pipeline-status)
[![Twitter](https://img.shields.io/twitter/follow/MSFTDeepSpeed)](https://twitter.com/intent/follow?screen_name=MSFTDeepSpeed)
[![Japanese Twitter](https://img.shields.io/badge/%E6%97%A5%E6%9C%AC%E8%AA%9ETwitter-%40MSFTDeepSpeedJP-blue)](https://twitter.com/MSFTDeepSpeedJP)
Expand Down
5 changes: 2 additions & 3 deletions blogs/deepspeed-ulysses/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ match this analysis.

### Additional Highlights of DeepSpeed-Ulysses

1) An Attention Agnostic Solution
***An Attention Agnostic Solution***

DeepSpeed implementation of distributed attention module is general
enough to support any attention: e.g., self-attention, cross-attention,
Expand All @@ -165,8 +165,7 @@ per head but just with fewer heads, thus attention computation can be
replaced with any type of attention mechanisms, e.g., dense attention
and various forms of sparse attention.

2) Training Bigger Models with Longer Sequences through ZeRO-3
Integration
***Training Bigger Models with Longer Sequences through ZeRO-3 Integration***

While DeepSpeed sequence parallelism reduces the activation memory when
training with longer sequences, it does not impact the memory consumed
Expand Down

0 comments on commit 961827b

Please sign in to comment.