Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another thing to merge. (MY EYES HURT) #1

Merged
merged 199 commits into from
Jan 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
199 commits
Select commit Hold shift + click to select a range
1f97242
only global rank 0 can log tensorboard data; avoid multi gpu/node rac…
arashashari Jul 21, 2020
871f7e6
Update setup.py (#298)
jeffra Jul 22, 2020
3cc96e1
Avoid deadlock for unsynchronized non-zero checkpointing (#297)
tjruwase Jul 22, 2020
eb74c3f
updates to amp to support grad clip and grad accumulation (#290)
jeffra Jul 23, 2020
ec94341
pass steps_per_print to tput timer (#299)
jeffra Jul 24, 2020
0f94f7e
bump DSExamples (#300)
jeffra Jul 24, 2020
7ae8f8b
DeepSpeed webinar announcement (#301)
Jul 25, 2020
67821f9
Update README.md (#302)
jeffra Jul 27, 2020
97c5427
Fixing a typo (#303)
MannyKayy Jul 28, 2020
e50b883
Fix nv_peer_mem version (#304)
jeffra Jul 28, 2020
9d07d75
NameError: name 'mpu' is not defined (#305)
MannyKayy Aug 1, 2020
c35e944
Removing () from assertion. (#307)
Aug 7, 2020
29c5fe2
Add webinar link (#309)
jeffra Aug 7, 2020
903a41a
updates website gems after kramdown alert (#311)
Aug 8, 2020
cd68e6e
Fix+tests for get_lr from lr_scheduler before training starts (#310)
jeffra Aug 10, 2020
892ece6
bumping DSE commit for pillow security fix (#312)
Aug 12, 2020
3437342
Update deepspeed_lr_schedules.py (#314)
jeffra Aug 12, 2020
6855ba1
Update fan out flag for pdsh (#315)
jeffra Aug 13, 2020
e1bea67
attach empty grad to its param to ensure it's copied after reduction …
jeffra Aug 13, 2020
de0523d
bump DSE (#317)
jeffra Aug 14, 2020
e69b1ee
Turn off multi-node launch if only 1 node (#322)
jeffra Aug 18, 2020
21d5f63
Add code owners for DeepSpeed team (#335)
jeffra Aug 27, 2020
6823db3
bump DSE
jeffra Aug 28, 2020
458c0d9
Update deepspeed_checkpointing.py (#336)
samyam Aug 31, 2020
7240abf
Samyamr/grad acc stage2 (#338)
samyam Sep 1, 2020
7a356b2
Rename ds_config_func_bs8_zero2_gas10.json to ds_config_func_bs8_zero…
samyam Sep 1, 2020
6122a74
Rename ds_config_func_bs8_zero0_gas10.json to ds_config_func_bs8_zero…
samyam Sep 1, 2020
f4726b7
Update run_func_test.py
samyam Sep 1, 2020
e8dd47d
Update .gitignore
jeffra Sep 1, 2020
838f53b
Switches BBS example to use mbsize=3 and gas=2 to fit in 16GB of memo…
Sep 1, 2020
e5bbc2e
Sparse attn + ops/runtime refactor + v0.3.0 (#343)
jeffra Sep 2, 2020
8716540
Update Dockerfile
jeffra Sep 2, 2020
5518aae
Update Dockerfile
jeffra Sep 2, 2020
1661e83
update DSE and rename SA tests
jeffra Sep 2, 2020
1ebcd6c
Update test_sparse_attention.py
jeffra Sep 3, 2020
6deac82
Adding link to Sparse Attention in Navigation page (#355)
arashashari Sep 3, 2020
ac12833
Jekyll installation instructions (#351)
Sep 4, 2020
a64b0ab
fixed a typo; this was fixed before but seems like it has been lost i…
arashashari Sep 5, 2020
4d4eafb
Move code quality tests to Azure-hosted agents. (#368)
Sep 5, 2020
9e83ef2
Update installation instructions (#362)
tjruwase Sep 6, 2020
9dadf38
Update Sparse Attention Tutorial (#357)
arashashari Sep 6, 2020
b73894d
adding sparse attention to feature index page (#377)
arashashari Sep 9, 2020
234bba0
temp disable model tests
jeffra Sep 9, 2020
01726ce
Add 1-bit Adam support to DeepSpeed (#380)
awan-10 Sep 9, 2020
161e8e6
fixing a link issue with SA tutorial (#387)
arashashari Sep 9, 2020
79093d7
Update test triggers to exclude docs
jeffra Sep 9, 2020
41db1c2
ZeRO-Offload release (#391)
jeffra Sep 10, 2020
65c2f97
Pipeline parallel training engine. (#392)
Sep 10, 2020
093f09f
Update documentation for 1-bit Adam (#388)
awan-10 Sep 10, 2020
dca0b78
Fix datatype issue with sparse attention softmax (#363)
jeffra Sep 10, 2020
c0d5424
Add openmpi to dockerfile
jeffra Sep 10, 2020
2dea61f
ZeRO tutorials (#384)
tjruwase Sep 10, 2020
b1d4bd7
fix for 16GB v100 nodes (#393)
jeffra Sep 10, 2020
be4b94b
Sparse attention: updating code tag in documentation (#394)
arashashari Sep 10, 2020
59ce90d
Minjiaz/zero offload (#382)
minjiaz Sep 10, 2020
c76769c
Adding sparse attention news index item (#376)
arashashari Sep 10, 2020
a8a8b3d
Landing page updates (#395)
jeffra Sep 10, 2020
7baf3c3
Update README.md
jeffra Sep 10, 2020
6bb5c69
Website edits (#398)
Sep 10, 2020
b29229b
update docker image and bump DSE
jeffra Sep 10, 2020
240ea97
only add 1bit adam reqs if mpi is installed, update cond build for cp…
jeffra Sep 10, 2020
4b1df25
bump DSE and doc tweak
jeffra Sep 10, 2020
9693595
Update README.md
jeffra Sep 10, 2020
ea92ed2
Update _config.yml
jeffra Sep 10, 2020
5dc4d6c
Update news site with press release link
jeffra Sep 10, 2020
d15015e
Update ZeRO-Offload blog post link (#401)
tjruwase Sep 10, 2020
15ca99c
remove old pt file
jeffra Sep 10, 2020
c82756c
readthedocs upgrade (#402)
Sep 10, 2020
e549be6
supporting different intermediate sizes other than 4 * hidden_dim (#389)
RezaYazdaniAminabadi Sep 11, 2020
4ac9bf6
Revert "supporting different intermediate sizes other than 4 * hidden…
jeffra Sep 11, 2020
473ff98
scales throughput by logging freq (#408)
Sep 13, 2020
91b4a93
pytest skips for tests requiring certain ops (#411)
jeffra Sep 15, 2020
55ed105
fix bug related to stitching reduced grads across communication parti…
jeffra Sep 15, 2020
a9e8325
add cpu-adam, reformat, add colors (#413)
jeffra Sep 15, 2020
0e942df
Add Linear warmup+decay lr schedule (#414)
tjruwase Sep 16, 2020
7d91be9
Minor doc fixes (#417)
tjruwase Sep 16, 2020
f5cce75
Overflow fix (#416)
Sep 16, 2020
4fef478
Fix a typo in comments (#415)
eric-haibin-lin Sep 16, 2020
5812e84
readthedocs yaml configuration (#410)
Sep 17, 2020
c66f388
Fix few typos in the docs (#418)
gowthamvadisetti Sep 17, 2020
5bc7d4e
Remove pip --use-feature (#419)
jeffra Sep 17, 2020
01b6e27
Activation checkpointing bugfix and unit tests (#420)
Sep 18, 2020
a74a604
Revert "Activation checkpointing bugfix and unit tests (#420)" (#422)
jeffra Sep 18, 2020
a825f99
Fix activation checkpoint unit tests for GPU systems (#421)
Sep 18, 2020
a148bd3
Add configurable intermediate size to transformer kernels (#423)
RezaYazdaniAminabadi Sep 21, 2020
71f7df3
DSE bump (#427)
Sep 21, 2020
f0f2a70
support dynamic sequence length in transformer kernels (#424)
RezaYazdaniAminabadi Sep 22, 2020
5d40f00
Fix urls in tutorial (#436)
conglongli Sep 24, 2020
192cf7c
Update azure.md (#437)
conglongli Sep 24, 2020
0ca8215
Update pipeline.md (#439)
eric-haibin-lin Sep 25, 2020
6d176c4
link fix part two :-) (#441)
Sep 25, 2020
5412a33
unit test rename (#442)
Sep 25, 2020
6f28ea3
fix typos (#446)
eric-haibin-lin Sep 28, 2020
7b8be2a
Disable default installation of CPU Adam (#450)
tjruwase Sep 29, 2020
9557557
Use parentesis around min and max to enable Windows build (#449)
bratao Oct 1, 2020
6717638
Update engine.py (#458)
Zixxy Oct 5, 2020
11cf47e
temporarily disable lr unit tests
jeffra Oct 7, 2020
679fc13
turning off different tests (temp)
jeffra Oct 7, 2020
2efea69
gan tutorial (#462)
niumanar Oct 7, 2020
c39a76f
Fix printing momentum for non-deepspeed optimizer (#464)
Zixxy Oct 7, 2020
23fc48f
Add DeepSpeed_Adam optimizer (#468)
tjruwase Oct 10, 2020
e25f2a2
fixing typo (#460)
RezaYazdaniAminabadi Oct 12, 2020
b8eb40e
add compute cap of 6.0 to transformer kernels
jeffra Oct 12, 2020
1afca8f
revert previous (accidental) change
jeffra Oct 12, 2020
7ddfda8
Add support for p100 in transformer kernels (#470)
jeffra Oct 14, 2020
d720fdb
updating website dependencies (#475)
Oct 19, 2020
f5aa254
Add CPUAdam optimizer for zero-offload in deepspeed engine (#484)
RezaYazdaniAminabadi Oct 30, 2020
4c37d70
fixing the AVX_256 compatibility (#497)
RezaYazdaniAminabadi Oct 30, 2020
7d4d742
Fixing CPU-Adam convergence issue (#503)
RezaYazdaniAminabadi Nov 5, 2020
e351090
PLD documentation (#514)
tjruwase Nov 9, 2020
41fb24b
Fix PLD news url (#515)
tjruwase Nov 9, 2020
e082d47
updating pld docs (#517)
minjiaz Nov 10, 2020
be1147c
PLD release (#513)
tjruwase Nov 10, 2020
eea1c28
fix bug on non-DLTS infra when no output path set (#523)
jeffra Nov 11, 2020
0ad4fd8
Update zero.md tutorial (#495)
samyam Nov 11, 2020
31f46fe
DeepSpeed JIT op + PyPI support (#496)
jeffra Nov 12, 2020
ca9ab12
ds_report bug fix on cpu and guard torch import in setup.py (#524)
jeffra Nov 12, 2020
d779bd5
Installation documentation updates. (#525)
Nov 12, 2020
0dc8420
Dependency pruning (#528)
jeffra Nov 14, 2020
9941ce7
bump version
jeffra Nov 14, 2020
7752dc5
Fix layout bug in ZeRO Stage 1 checkpoint logic (#531)
tjruwase Nov 18, 2020
5b09be6
append job-name if explicit output dir is given (#539)
jeffra Nov 18, 2020
fdd81c3
more fine-grained manifest file for includes/excludes (#540)
jeffra Nov 19, 2020
08c96a1
ZeRO-1 tune max-elems + bug fix (#532)
jeffra Nov 19, 2020
9de21b7
bump to v0.3.3
jeffra Nov 19, 2020
dce054d
backwards compatability w. v020 ckpts, fix issue with zero-1 ckpts (#…
jeffra Nov 19, 2020
d81cb26
Fix setup.py for cpu-only environment installation (#538)
harrydrippin Nov 19, 2020
1b45917
Discover variables for NCCL backend on AML without mpi4py (#542)
awan-10 Nov 19, 2020
6b28bc5
bump version 0.3.4
jeffra Nov 19, 2020
0178e6c
Fix unbalanced gradients bug in ZeRO-2 gradient accumulation (#545)
tjruwase Nov 20, 2020
6021b70
Support non-tensor state in checkpoint (#548)
tjruwase Nov 21, 2020
bcd56f9
Adding static_loss_scale to unfused optimizer (#546)
samyam Nov 23, 2020
00c3a25
Bug fix for norm calculation in absence of model parallel group (#551)
samyam Nov 23, 2020
c18fb0d
Create main.yml
jeffra Nov 24, 2020
3347460
Switch to CI to GitHub Actions (#556)
jeffra Nov 24, 2020
1ef5cd2
Update badges and CI name (#557)
jeffra Nov 24, 2020
6e65c2c
Deprecate client ability to disable gradient reduction (#552)
tjruwase Nov 24, 2020
0e831e2
Simplify dist init and only init if needed. (#553)
awan-10 Nov 25, 2020
eec44af
Turn back on PP tests (#558)
jeffra Nov 25, 2020
16313a9
bump to 0.3.5
jeffra Nov 23, 2020
6009713
Adds long_description to setup.py (#560)
Nov 25, 2020
73c3262
bump to 0.3.6 and fix manifest to include reqs (#561)
jeffra Nov 25, 2020
e4e2066
update manifest
jeffra Nov 25, 2020
c51fa65
bump to 0.3.7
jeffra Nov 25, 2020
17f36f1
[doc] typo fix and clarification (#563)
stas00 Nov 28, 2020
c78c29f
supporting different hidden dimensions (#559)
RezaYazdaniAminabadi Dec 1, 2020
9f52a36
tracking optimizer step in cpu-adam when loading checkpoint (#564)
RezaYazdaniAminabadi Dec 1, 2020
7a75f8b
[cifar tutorial] improve readability (#567)
stas00 Dec 2, 2020
845921b
Add 'latest' checkpoint save/load support (#569)
jeffra Dec 2, 2020
2d1f7c0
[engine] train should be able to get `mode` arg (#571)
stas00 Dec 3, 2020
be33bea
Add compute capability 8.0 if on cuda 11+ (#572)
jeffra Dec 3, 2020
ff58fa7
[build] build against installed cuda-11.1 while torch built w/ cuda-1…
stas00 Dec 3, 2020
1e44d48
Fix potential random layout inconsistency issues in sparse attention …
Justin1904 Dec 4, 2020
ce363d0
[build] make builder smarter and configurable wrt compute capabilitie…
stas00 Dec 7, 2020
e8b126d
[build] add compute_86 (#577)
stas00 Dec 7, 2020
2f62697
Pipeline warnings and checkpoint portability (#588)
Dec 8, 2020
d901a6d
Pin triton to 0.2.3 for now, 0.3.0 is broken
jeffra Dec 9, 2020
cb7c7da
bump to 0.3.8
jeffra Dec 9, 2020
19acd6c
Add papers/videos to readme/website (#592)
jeffra Dec 9, 2020
7300f3e
Add AML video link
jeffra Dec 9, 2020
0518252
add manual workflow to run tests with precompiled ops
jeffra Dec 11, 2020
8a184b6
[build] fix computer capability arch flags, add PTX, handle PTX (#591)
stas00 Dec 11, 2020
66268bd
add DeepSpeedZeroConfig repr method (#596)
stas00 Dec 11, 2020
a4763f5
Supported customizing kwargs for lr_scheduler (#584)
carefree0910 Dec 11, 2020
c5a449f
Update launcher to set local rank environ variable (#597)
jeffra Dec 11, 2020
9f8e8f3
implement missing get_last_lr (#595)
stas00 Dec 14, 2020
007466e
[doc] xref to hostfile discussion (#604)
stas00 Dec 15, 2020
6380ee3
Fixes for RTD build errors (#606)
jeffra Dec 15, 2020
fd2f970
Transformer-kernel - supporting any arbitrary sequence-length (#587)
RezaYazdaniAminabadi Dec 17, 2020
7435b2f
Ability to initialize distributed backend outside deepspeed runtime (…
jeffra Dec 18, 2020
81aeea3
Elastic training support (#602)
jeffra Dec 23, 2020
24e0739
update SA comp check to fix torch-cpu issue (#631)
jeffra Jan 4, 2021
e6ac731
Support initialization with dict configuration (#632)
tjruwase Jan 4, 2021
a9a83a6
Allow DeepSpeed models to be initialized with optimizer=None (#469)
gcooper-isi Jan 5, 2021
d38ad6a
change dist to torch.distributed to fix bug in assert. (#638)
awan-10 Jan 5, 2021
46d2e28
docs: minor spelling tweaks (#623)
brettkoonce Jan 5, 2021
5ab1279
Fix docstring format (#640)
tjruwase Jan 5, 2021
44bd538
Module replacement support (#586)
jeffra Jan 6, 2021
64461da
Update builder.py (#642)
sxjscience Jan 7, 2021
8cea96d
Bump nokogiri from 1.10.10 to 1.11.0 in /docs (#630)
dependabot[bot] Jan 7, 2021
4e2dc4e
Add deepspeed.init_distributed to RTD page (#645)
jeffra Jan 7, 2021
828d75b
document deepspeed.initialize() (#644)
stas00 Jan 8, 2021
bc046dc
add additional validation checks in elastic config (#646)
jeffra Jan 8, 2021
af212f6
Remove a very verbose print statement. (#649)
awan-10 Jan 8, 2021
c14b839
version bump to 0.3.10
jeffra Jan 8, 2021
da5563a
LR scheduler unit tests (#429)
tjruwase Jan 8, 2021
adcfd26
Handle actvitation checkpointing args that are None or non-tensors (#…
Jan 12, 2021
e2fbe4d
squash latest flops profiling changes (#1) (#664)
cli99 Jan 13, 2021
981bc7d
Move workspace memory-allocation to PyTorch (#661)
RezaYazdaniAminabadi Jan 13, 2021
f032e56
Validate consistent ckpt tags across ranks (#667)
jeffra Jan 14, 2021
865104b
Support optimizer AdamW type (#670)
tjruwase Jan 15, 2021
6217a6c
skip empty lines in hostfile (#669)
jeffra Jan 15, 2021
c5e4264
Add AdamW to the supported optimizers (#672)
stas00 Jan 15, 2021
e729a3f
add missing config menu entries (#652)
stas00 Jan 15, 2021
7b07e12
doc fix (#651)
stas00 Jan 15, 2021
82cecf6
add zero-offload paper (#680)
jeffra Jan 19, 2021
7b0bee0
[tutorials] typos (#676)
stas00 Jan 20, 2021
e59ba12
make test_pipe more stable (#683)
Jan 20, 2021
34c83a5
Fix ZeRO 2 + Pipelining (#677)
leogao2 Jan 20, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ AllowShortLoopsOnASingleLine: true
AlwaysBreakAfterDefinitionReturnType: None
AlwaysBreakAfterReturnType: None
AlwaysBreakBeforeMultilineStrings: true
AlwaysBreakTemplateDeclarations: Yes
AlwaysBreakTemplateDeclarations: true
BinPackArguments: false
BinPackParameters: false
BraceWrapping:
Expand Down
51 changes: 51 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# This is a basic workflow to help you get started with Actions

name: Build

# Controls when the action will run.
on:
push:
paths-ignore:
- 'docs/**'
pull_request:
paths-ignore:
- 'docs/**'

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
# This workflow contains a single job called "build"
build:
# The type of runner that the job will run on
runs-on: self-hosted

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- uses: actions/checkout@v2

# Runs a single command using the runners shell
- name: environment
run: |
nvidia-smi
which python
python --version
which nvcc
nvcc --version
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

# Runs a set of commands using the runners shell
- name: Install deepspeed
run: |
pip install .[dev]
ds_report

- name: Formatting checks
run: |
pre-commit run --all-files

# Runs a set of commands using the runners shell
- name: Unit tests
run: |
if [[ -d ./torch-extensions ]]; then rm -rf ./torch-extensions; fi
TORCH_EXTENSIONS_DIR=./torch-extensions pytest --durations=0 --forked --verbose -x tests/unit/
47 changes: 47 additions & 0 deletions .github/workflows/pre-compile-ops.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# This is a basic workflow to help you get started with Actions

name: Tests-w-precompiled-ops

# Controls when the action will run.
on:
# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
# This workflow contains a single job called "build"
build:
# The type of runner that the job will run on
runs-on: self-hosted

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- uses: actions/checkout@v2

# Runs a single command using the runners shell
- name: environment
run: |
nvidia-smi
which python
python --version
which nvcc
nvcc --version
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

# Runs a set of commands using the runners shell
- name: Install deepspeed
run: |
DS_BUILD_OPS=1 pip install .[dev]
ds_report

- name: Formatting checks
run: |
pre-commit run --all-files

# Runs a set of commands using the runners shell
- name: Unit tests
run: |
if [[ -d ./torch-extensions ]]; then rm -rf ./torch-extensions; fi
TORCH_EXTENSIONS_DIR=./torch-extensions pytest --durations=0 --forked --verbose -x tests/unit/
11 changes: 9 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,28 @@
*~
*.swp
*.log
deepspeed/git_version_info.py
deepspeed/git_version_info_installed.py

# Build + installation data
build/
dist/
fused_lamb_*.so
*.so
deepspeed.egg-info/
build.txt

# Website
docs/_site/
docs/build
docs/code-docs/source/_build
docs/code-docs/_build
docs/code-docs/build
.sass-cache/
.jekyll-cache/
.jekyll-metadata

# Testing data
tests/unit/saved_checkpoint/

# Dev/IDE data
.vscode
.theia
3 changes: 0 additions & 3 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
[submodule "third_party/apex"]
path = third_party/apex
url = https://github.com/NVIDIA/apex.git
[submodule "DeepSpeedExamples"]
path = DeepSpeedExamples
url = https://github.com/microsoft/DeepSpeedExamples
Expand Down
18 changes: 18 additions & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@

# Required
version: 2

# Build documentation in the docs/ directory with Sphinx
sphinx:
configuration: docs/code-docs/source/conf.py
fail_on_warning: false

# Optionally build your docs in additional formats such as PDF
formats:
- pdf

# Optionally set the version of Python and requirements required to build your docs
python:
version: 3.7
install:
- requirements: requirements/requirements-readthedocs.txt
1 change: 1 addition & 0 deletions CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* @jeffra @samyam @tjruwase @ShadenSmith @conglongli @awan-10 @arashashari @cli99 @eltonzheng @minjiaz @RezaYazdaniAminabadi @niumanar
2 changes: 1 addition & 1 deletion DeepSpeedExamples
4 changes: 4 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
include *.txt README.md
recursive-include requirements *.txt
recursive-include deepspeed *.cpp *.h *.cu *.tr *.cuh *.cc
recursive-include csrc *.cpp *.h *.cu *.tr *.cuh *.cc
115 changes: 91 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
[![Build Status](https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_apis/build/status/microsoft.DeepSpeed?branchName=master)](https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_build/latest?definitionId=1&branchName=master)
[![Build Status](https://github.com/microsoft/deepspeed/workflows/Build/badge.svg)](https://github.com/microsoft/DeepSpeed/actions)
[![PyPI version](https://badge.fury.io/py/deepspeed.svg)](https://pypi.org/project/deepspeed/)
[![Documentation Status](https://readthedocs.org/projects/deepspeed/badge/?version=latest)](https://deepspeed.readthedocs.io/en/latest/?badge=latest)
[![License MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
[![Docker Pulls](https://img.shields.io/docker/pulls/deepspeed/deepspeed)](https://hub.docker.com/r/deepspeed/deepspeed)

[DeepSpeed](https://www.deepspeed.ai/) is a deep learning optimization
library that makes distributed training easy, efficient, and effective.
Expand All @@ -9,9 +11,13 @@ library that makes distributed training easy, efficient, and effective.
<p align="center"><i><b>10x Faster Training</b></i></p>
<p align="center"><i><b>Minimal Code Change</b></i></p>

DeepSpeed can train deep learning models with over a hundred billion parameters on current
generation of GPU clusters, while achieving over 10x in system performance
compared to the state-of-art. Early adopters of DeepSpeed have already produced
DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:
* Extreme scale: Using current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
* Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
* Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
* Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth. 1-bit Adam reduces communication volume by up to 5x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks.

Early adopters of DeepSpeed have already produced
a language model (LM) with over 17B parameters called
[Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft),
establishing a new SOTA in the LM category.
Expand All @@ -25,25 +31,26 @@ information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale)


# News

* [2020/05/19] [ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/)
<span style="color:dodgerblue">**[_NEW_]**</span>
* [2020/05/19] [An Order-of-Magnitude Larger and Faster Training with ZeRO-2](https://www.deepspeed.ai/news/2020/05/18/zero-stage2.html)
<span style="color:dodgerblue">**[_NEW_]**</span>
* [2020/05/19] [The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels](https://www.deepspeed.ai/news/2020/05/18/bert-record.html)
<span style="color:dodgerblue">**[_NEW_]**</span>
* [2020/02/13] [Turing-NLG: A 17-billion-parameter language model by Microsoft](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/)
* [2020/02/13] [ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
* [2020/11/12] [Simplified install, JIT compiled ops, PyPI releases, and reduced dependencies](#installation)
* [2020/11/10] [Efficient and robust compressed training through progressive layer dropping](https://www.deepspeed.ai/news/2020/10/28/progressive-layer-dropping-news.html)
* [2020/09/10] [DeepSpeed v0.3: Extreme-scale model training for everyone](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/)
* [Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention-news.html)
* [Training a trillion parameters with pipeline parallelism](https://www.deepspeed.ai/news/2020/09/08/pipeline-parallelism.html)
* [Up to 5x less communication and 3.4x faster training through 1-bit Adam](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-news.html)
* [10x bigger model training on a single GPU with ZeRO-Offload](https://www.deepspeed.ai/news/2020/09/08/ZeRO-Offload.html)
* [2020/08/07] [DeepSpeed Microsoft Research Webinar](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-On-Demand.html) is now available on-demand


# Table of Contents
| Section | Description |
| --------------------------------------- | ------------------------------------------- |
| [Why DeepSpeed?](#why-deepspeed) | DeepSpeed overview |
| [Features](#features) | DeepSpeed features |
| [Further Reading](#further-reading) | DeepSpeed documentation, tutorials, etc. |
| [Contributing](#contributing) | Instructions for contributing to DeepSpeed |
| [Publications](#publications) | DeepSpeed publications |
| [Install](#installation) | Installation details |
| [Features](#features) | Feature list and overview |
| [Further Reading](#further-reading) | Documentation, tutorials, etc. |
| [Contributing](#contributing) | Instructions for contributing |
| [Publications](#publications) | Publications related to DeepSpeed |
| [Videos](#videos) | Videos related to DeepSpeed |

# Why DeepSpeed?
Training advanced deep learning models is challenging. Beyond model design,
Expand All @@ -55,8 +62,35 @@ a large model easily runs out of memory with pure data parallelism and it is
difficult to use model parallelism. DeepSpeed addresses these challenges to
accelerate model development *and* training.

# Features
# Installation

The quickest way to get started with DeepSpeed is via pip, this will install
the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA
versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer
to as our 'ops'. By default, all of these extensions/ops will be built
just-in-time (JIT) using [torch's JIT C++ extension loader that relies on
ninja](https://pytorch.org/docs/stable/cpp_extension.html) to build and
dynamically link them at runtime.

**Note:** [PyTorch](https://pytorch.org/) must be installed _before_ installing
DeepSpeed.

```bash
pip install deepspeed
```

After installation, you can validate your install and see which extensions/ops
your machine is compatible with via the DeepSpeed environment report.

```bash
ds_report
```

If you would like to pre-install any of the DeepSpeed extensions/ops (instead
of JIT compiling) or install pre-compiled ops via PyPI please see our [advanced
installation instructions](https://www.deepspeed.ai/tutorials/advanced-install/).

# Features
Below we provide a brief feature list, see our detailed [feature
overview](https://www.deepspeed.ai/features/) for descriptions and usage.

Expand All @@ -66,10 +100,27 @@ overview](https://www.deepspeed.ai/features/) for descriptions and usage.
* [Model Parallelism](https://www.deepspeed.ai/features/#model-parallelism)
* Support for Custom Model Parallelism
* Integration with Megatron-LM
* [Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#memory-and-bandwidth-optimizations)
* The Zero Redundancy Optimizer (ZeRO)
* Constant Buffer Optimization (CBO)
* [Pipeline Parallelism](https://www.deepspeed.ai/tutorials/pipeline/)
* 3D Parallelism
* [The Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/)
* Optimizer State and Gradient Partitioning
* Activation Partitioning
* Constant Buffer Optimization
* Contiguous Memory Optimization
* [ZeRO-Offload](https://www.deepspeed.ai/tutorials/zero-offload/)
* Leverage both CPU/GPU memory for model training
* Support 10B model training on a single GPU
* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/news/2020/05/18/bert-record.html)
* [Sparse attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html)
* Memory- and compute-efficient sparse kernels
* Support 10x long sequences than dense
* Flexible support to different sparse structures
* [1-bit Adam](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html)
* Custom communication collective
* Up to 5x communication volume saving
* [Additional Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#additional-memory-and-bandwidth-optimizations)
* Smart Gradient Accumulation
* Communication/Computation Overlap
* [Training Features](https://www.deepspeed.ai/features/#training-features)
* Simplified training API
* Gradient Clipping
Expand All @@ -79,6 +130,7 @@ overview](https://www.deepspeed.ai/features/) for descriptions and usage.
* Memory bandwidth optimized FP16 Optimizer
* Large Batch Training with LAMB Optimizer
* Memory efficient Training with ZeRO Optimizer
* CPU-Adam
* [Training Agnostic Checkpointing](https://www.deepspeed.ai/features/#training-agnostic-checkpointing)
* [Advanced Parameter Search](https://www.deepspeed.ai/features/#advanced-parameter-search)
* Learning Rate Range Test
Expand Down Expand Up @@ -127,8 +179,23 @@ all repos using our CLA.
This project has adopted the [Microsoft Open Source Code of
Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the
[Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
[[email protected]](mailto:[email protected]) with any additional questions or
comments.
[[email protected]](mailto:[email protected]) with any additional questions or comments.

# Publications
1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: Memory Optimization Towards Training A Trillion Parameter Models. [ArXiv:1910.02054](https://arxiv.org/abs/1910.02054)
1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. [arXiv:1910.02054](https://arxiv.org/abs/1910.02054) and [In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20)](https://dl.acm.org/doi/10.5555/3433701.3433727).
2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. [In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial)](https://dl.acm.org/doi/10.1145/3394486.3406703).
3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. [arXiv:2010.13369](https://arxiv.org/abs/2010.13369) and [NeurIPS 2020](https://proceedings.neurips.cc/paper/2020/hash/a1140a3d0df1c81e24ae954d935e8926-Abstract.html).
4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840).

# Videos
1. DeepSpeed KDD 2020 Tutorial
1. [Overview](https://www.youtube.com/watch?v=CaseqC45DNc&list=PLa85ZdUjfWS21mgibJ2vCvLziprjpKoW0&index=29)
2. [ZeRO + large model training](https://www.youtube.com/watch?v=y4_bCiAsIAk&list=PLa85ZdUjfWS21mgibJ2vCvLziprjpKoW0&index=28)
3. [17B T-NLG demo](https://www.youtube.com/watch?v=9V-ZbP92drg&list=PLa85ZdUjfWS21mgibJ2vCvLziprjpKoW0&index=27)
4. [Fastest BERT training + RScan tuning](https://www.youtube.com/watch?v=o1K-ZG9F6u0&list=PLa85ZdUjfWS21mgibJ2vCvLziprjpKoW0&index=26)
5. DeepSpeed hands on deep dive: [part 1](https://www.youtube.com/watch?v=_NOk-mBwDYg&list=PLa85ZdUjfWS21mgibJ2vCvLziprjpKoW0&index=92), [part 2](https://www.youtube.com/watch?v=sG6_c4VXLww&list=PLa85ZdUjfWS21mgibJ2vCvLziprjpKoW0&index=94), [part 3](https://www.youtube.com/watch?v=k9yPkBTayos&list=PLa85ZdUjfWS21mgibJ2vCvLziprjpKoW0&index=93)
6. [FAQ](https://www.youtube.com/watch?v=nsHu6vEgPew&list=PLa85ZdUjfWS21mgibJ2vCvLziprjpKoW0&index=24)
2. Microsoft Research Webinar
* Registration is free and all videos are available on-demand.
* [ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-On-Demand.html).
3. [DeepSpeed on AzureML](https://youtu.be/yBVXR8G8Bg8)
36 changes: 0 additions & 36 deletions azure-pipelines-docker.yml

This file was deleted.

Loading