Skip to content

Commit

Permalink
ZeRO-Offload (squash) (#381)
Browse files Browse the repository at this point in the history
* ZeRO-Offload v1 (squash) (#345)

* update DSE to point to ZeRO-Offload staging

* ZeRO-2 enable CPU offload (#313)

* cpu-offload

* update

* deleted:    deepspeed/pt/deepspeed_zero_optimizer_cpuoffload.py
	modified:   deepspeed/pt/fp16_unfused_optimizer.py
	new file:   install_output.txt
	modified:   tests/unit/test_dynamic_loss_scale.py

* modified:   deepspeed/pt/deepspeed_zero_optimizer.py

* update

* modified:   deepspeed/pt/deepspeed_cpu_adam.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py
	modified:   tests/unit/test_checkpointing.py
	modified:   tests/unit/test_fp16.py

* deleted:    install_output.txt

* modified:   deepspeed/pt/fp16_unfused_optimizer.py
	modified:   tests/unit/test_dynamic_loss_scale.py

* modified:   deepspeed/pt/deepspeed_cpu_adam.py

* modified:   deepspeed/pt/deepspeed_zero_optimizer.py

* modified:   deepspeed/pt/deepspeed_cpu_adam.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py

* deleted:    deepspeed_cpu_adam.py
	modified:   deepspeed_light.py
	modified:   deepspeed_zero_optimizer.py
	../../deepspeed_zero_optimizer_cpu_offload.py

* modified:   deepspeed/pt/deepspeed_light.py

* modified:   deepspeed/pt/deepspeed_light.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py
	modified:   deepspeed/pt/deepspeed_zero_utils.py
	modified:   tests/unit/test_fp16.py

* modified:   deepspeed/pt/deepspeed_config.py
	modified:   deepspeed/pt/deepspeed_light.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py
	modified:   tests/unit/test_checkpointing.py
	modified:   tests/unit/test_fp16.py

* modified:   deepspeed/pt/deepspeed_checkpointing.py

* update DSE to ZeRO-Offload commit

Co-authored-by: Jeff Rasley <[email protected]>

* Enable ZeRO checkpointing for ZeRO-Offload (#337)

* Enable ZeRO checkpointing for ZeRO-Offload
Fix unit tests
Bump DSE to 33b9fb77c8cecdb49118188890f662526d8e9397

* Fix accidental revert

* Add ZeRO-Offload checkpointing model tests (#344)

* Enable ZeRO checkpointing for ZeRO-Offload
Fix unit tests
Bump DSE to 33b9fb77c8cecdb49118188890f662526d8e9397

* Fix accidental revert

* Fix ZeRO-Offload checkpointing bug when change gpu count
Add checkpointing model tests for ZeRO-Offload
Remove optimizer key from Megatron model tests
Use different deepspeed master port for Megatron model tests

Co-authored-by: Jie <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>

* update DSE to staging for zero-dual

* Update test_sparse_attention.py

* Assert ZeRO-Offload+gradient accumulation (#347)

* Adding link to Sparse Attention in Navigation page (#355)

* adding link to Sparse Attention in Navigation page

* Correctness and perf fixes (#354)

* Update test_sparse_attention.py

* jren changes

* Merge with correctness/perf fixes

* Formatting fixes

Co-authored-by: Jeff Rasley <[email protected]>

* add cpu adam optimizer (#356)

* add cpu adam optimizer

* run precommit

* clean adam_test

* add accuracy test for adam

* make the adam unit test work with random params and grads and for more steps

* Samyamr/zero offload correctness (#359)

* fixing gradient accumulation for zero offload

* Bug fixes. ZeRO Stage 1,2 and Offload all produce the same loss with gradient accumulation step of 2

* Import path fixes + conditional imports (#358)

* use relative imports and add support for conditional op imports

* formatting and llvm command check change

* fix remaining absolute import

* hide the isntalled ops var

* fix unit tests

Co-authored-by: Reza Yazdani <[email protected]>

* Enable contiguous gradients for cpu_offload

* Allocating CPU memory directly on CPU without transfering them from GPU (#360)

* Allocating CPU memory directly on CPU without transfering them from GPU

* formatting fixes

* change gpt2 pretrain to have DeepSpeed adam (#361)

Co-authored-by: Reza Yazdani <[email protected]>

* Jekyll installation instructions (#351)

* Generalize detection of ZeRO supported optimizers (#349)

* Improve test for ZeRO supported optimizers

* Rename test function

* Format fixes

* Add model tests that wraps client FusedAdam with fused fp16 optimizer

* Format fixes

* everything is working

* fixing the cpu_adam API and add deepspeed_adam flag in config.py (#365)

* fixing the cpu_adam API and add deepspeed_adam flag in config.py

* run precommit

* fixing adam copy fp16-param-add more compile flags for cpu_adam

* run precommit

* fix variance indexes

* fix array-sizes

* ZeRO-Offload passing model functionality tests (#366)

* cpu_offload enables overlap_comm and contiguous_gradients
Remove non-portable tensor.mul_()

* Model functionality tests now passing

* Move to perf tests folder

* move adam_test

* rename perf test

* fixing adam copy fp16-param and add more compile flags for cpu_adam (#367)

* fixing adam copy fp16-param-add more compile flags for cpu_adam

* run precommit

* fix variance indexes

* fix array-sizes

* move adam_test

* rename perf test

* Perf tests

* BumpDSE

* fixed a typo; this was fixed before but seems like it has been lost in the refactor (#364)

* Move code quality tests to Azure-hosted agents. (#368)

* add casting kernel

* run precommit

* revert changes

* revert changes

* ZeRO-Offload: Integration code fixes (#370)

* Various correctness fixes

* Format fixes

* Update installation instructions (#362)

* Update Sparse Attention Tutorial (#357)

* adding BingSqaud e2e test

* updating the draft test; bring final step under try section

* finalizinf test for base deepspeed and deepspeed with ZeRO

* applying the comment (thanks Jeff); fixed formatting

* update Sparse Attention Tutorial

* fixed few issues and applied comments for better organization and readability

* updated sparse attention tutorial with making how to use section incremental; applying more comments

Co-authored-by: arashashari <[email protected]>

* fixing corner cases (#371)

* fix adam perormance (#372)

* fixing corner cases

* revert to the previous perf for adam

* adam high performance

* run precommit

* ZeRO-Offload passing model tests (#374)

* Add ZeRO-Offload model tests
Restrict optimizer update+copy to DeepSpeedCPUAdam

* Format fixes

* Increate bucket size scaler

* fix cpu adam compilation for AVX2 (#378)

* fixing the compilation error for AVX2 architecture

* running precommit

* adding cpufeature to requirements

* Update install.sh

* Update install.sh

* include cpu-adam in the features

* update features

* update features

Co-authored-by: Jeff Rasley <[email protected]>

* Move code quality tests to Azure-hosted agents. (#368)

* Bump DSE

* adding sparse attention to feature index page (#377)

* support avx2 by default (#383)

* add DS_BUILD_AVX512 flag and update the feature part accordingly

* run precommit

Co-authored-by: Jie <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Arash Ashari <[email protected]>
Co-authored-by: RezaYazdaniAminabadi <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: arashashari <[email protected]>
  • Loading branch information
9 people committed Sep 9, 2020
1 parent 01726ce commit ad423f8
Show file tree
Hide file tree
Showing 45 changed files with 3,738 additions and 1,890 deletions.
2 changes: 1 addition & 1 deletion DeepSpeedExamples
627 changes: 627 additions & 0 deletions csrc/adam/cpu_adam.cpp

Large diffs are not rendered by default.

35 changes: 35 additions & 0 deletions csrc/adam/custom_cuda_kernel.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@


#include "custom_cuda_layers.h"

__global__ void param_update_kernel(const float* input, __half* output, int size)
{
const float4* input_cast = reinterpret_cast<const float4*>(input);
float2* output_cast = reinterpret_cast<float2*>(output);

int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id < size) {
float4 data = input_cast[id];
float2 cast_data;
__half* output_h = reinterpret_cast<__half*>(&cast_data);

output_h[0] = (__half)data.x;
output_h[1] = (__half)data.y;
output_h[2] = (__half)data.z;
output_h[3] = (__half)data.w;

output_cast[id] = cast_data;
}
}

void launch_param_update(const float* input, __half* output, int size, cudaStream_t stream)
{
int threads = 512;

size /= 4;
dim3 grid_dim((size - 1) / threads + 1);
dim3 block_dim(threads);

param_update_kernel<<<grid_dim, block_dim, 0, stream>>>(input, output, size);
}
118 changes: 118 additions & 0 deletions csrc/includes/cpu_adam.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
#pragma once

#include <cpuid.h>
#include <cuda_fp16.h>
#include <cuda_runtime_api.h>
#include <stdio.h>
#include <x86intrin.h>
#include <cassert>
#include "context.h"
#include "cublas_v2.h"
#include "cuda.h"
#include "curand.h"

#define CUDA_CHECK(callstr) \
{ \
cudaError_t error_code = callstr; \
if (error_code != cudaSuccess) { \
std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
assert(0); \
} \
}

#define TILE (1024 * 1024 * 1024)

#if defined(__AVX512__)
#define SIMD_STORE(a, d) _mm512_storeu_ps(a, d)
#define SIMD_LOAD(x) _mm512_loadu_ps(x)
#define SIMD_SET(x) _mm512_set1_ps(x)
#define SIMD_MUL(x, y) _mm512_mul_ps(x, y)
#define SIMD_FMA(x, y, c) _mm512_fmadd_ps(x, y, c)
#define SIMD_SQRT(x) _mm512_sqrt_ps(x)
#define SIMD_DIV(x, y) _mm512_div_ps(x, y)
#define SIMD_WIDTH 16
#else
#if defined(__AVX256__)
#define SIMD_STORE(a, d) _mm256_storeu_ps(a, d)
#define SIMD_LOAD(x) _mm256_loadu_ps(x)
#define SIMD_SET(x) _mm256_set1_ps(x)
#define SIMD_MUL(x, y) _mm256_mul_ps(x, y)
#define SIMD_FMA(x, y, c) _mm256_fmadd_ps(x, y, c)
#define SIMD_SQRT(x) _mm256_sqrt_ps(x)
#define SIMD_DIV(x, y) _mm256_div_ps(x, y)
#define SIMD_WIDTH 8
#endif
#endif

class Adam_Optimizer {
public:
Adam_Optimizer(float alpha = 1e-3,
float betta1 = 0.9,
float betta2 = 0.999,
float eps = 1e-8,
float weight_decay = 0)
: _alpha(alpha),
_betta1(betta1),
_betta2(betta2),
_eps(eps),
_weight_decay(weight_decay),
_betta1_t(1.0),
_betta2_t(1.0),
_buf_index(false)
{
cudaMallocHost((void**)_doubled_buffer, TILE * sizeof(float));
cudaMallocHost((void**)(_doubled_buffer + 1), TILE * sizeof(float));
}
~Adam_Optimizer()
{
cudaFreeHost(_doubled_buffer[0]);
cudaFreeHost(_doubled_buffer[1]);
}
void Step(float* _params,
float* grads,
float* _exp_avg,
float* _exp_avg_sq,
size_t param_size,
__half* dev_param = nullptr);
void Step_4(float* _params,
float* grads,
float* _exp_avg,
float* _exp_avg_sa,
size_t param_size,
__half* dev_param = nullptr);
void Step_8(float* _params,
float* grads,
float* _exp_avg,
float* _exp_avg_sq,
size_t _param_size,
__half* dev_params = nullptr);
inline void IncrementStep()
{
_betta1_t *= _betta1;
_betta2_t *= _betta2;
}

private:
#if defined(__AVX512__) or defined(__AVX256__)
union AVX_Data {
#if defined(__AVX512__)
__m512 data;
#else
__m256 data;
#endif
// float data_f[16];
};
#endif

float _alpha;
float _betta1;
float _betta2;
float _eps;
float _weight_decay;

float _betta1_t;
float _betta2_t;

float* _doubled_buffer[2];
bool _buf_index;
};
2 changes: 2 additions & 0 deletions csrc/includes/custom_cuda_layers.h
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -264,3 +264,5 @@ void launch_fuse_transpose_bias_kernel(const T* inp,
int rows,
int cols,
cudaStream_t stream);

void launch_param_update(const float* input, __half* output, int size, cudaStream_t stream);
18 changes: 10 additions & 8 deletions deepspeed/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,18 @@
import sys
import types

from deepspeed.runtime.engine import DeepSpeedEngine
from deepspeed.runtime.engine import ADAM_OPTIMIZER, LAMB_OPTIMIZER
from deepspeed.runtime.lr_schedules import add_tuning_arguments
from deepspeed.runtime.config import DeepSpeedConfig
from deepspeed.runtime.activation_checkpointing import checkpointing
from deepspeed.ops.transformer import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
from deepspeed.utils import logger
from . import ops

from .runtime.engine import DeepSpeedEngine
from .runtime.engine import ADAM_OPTIMIZER, LAMB_OPTIMIZER, DEEPSPEED_ADAM
from .runtime.lr_schedules import add_tuning_arguments
from .runtime.config import DeepSpeedConfig
from .runtime.activation_checkpointing import checkpointing
from .ops.transformer import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
from .utils import logger

try:
from deepspeed.git_version_info import version, git_hash, git_branch
from .git_version_info import version, git_hash, git_branch
except ImportError:
version = "0.0.0+unknown"
git_hash = None
Expand Down
7 changes: 7 additions & 0 deletions deepspeed/ops/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from ..git_version_info import installed_ops as __installed_ops__
from . import lamb
from . import transformer
if __installed_ops__['sparse-attn']:
from . import sparse_attention
if __installed_ops__['cpu-adam']:
from . import adam
1 change: 1 addition & 0 deletions deepspeed/ops/adam/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .cpu_adam import DeepSpeedCPUAdam
81 changes: 81 additions & 0 deletions deepspeed/ops/adam/cpu_adam.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
import math
import torch
import importlib

ds_opt_adam = None


class DeepSpeedCPUAdam(torch.optim.Optimizer):

optimizer_id = 0

def __init__(self,
model_params,
lr=1e-3,
betas=(0.9,
0.999),
eps=1e-8,
weight_decay=0,
amsgrad=False):

default_args = dict(lr=lr,
betas=betas,
eps=eps,
weight_decay=weight_decay,
amsgrad=amsgrad)
super(DeepSpeedCPUAdam, self).__init__(model_params, default_args)

self.opt_id = DeepSpeedCPUAdam.optimizer_id
DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1

global ds_opt_adam
ds_opt_adam = importlib.import_module('deepspeed.ops.adam.cpu_adam_op')
ds_opt_adam.create_adam(self.opt_id, lr, betas[0], betas[1], eps, weight_decay)

def __setstate__(self, state):
super(DeepSpeedCPUAdam, self).__setstate__(state)
for group in self.param_groups:
group.setdefault('amsgrad', False)

@torch.no_grad()
def step(self, closure=None, fp16_param_groups=None):
loss = None
if closure is not None:
with torch.enable_grad():
loss = closure()

for group_id, group in enumerate(self.param_groups):
for param_id, p in enumerate(group['params']):

if p.grad is None:
continue

grad = p.grad.data
state = self.state[p]
# State initialization
if len(state) == 0:
print(f'group {group_id} param {param_id} = {p.numel()}')
state['step'] = 0
# gradient momentums
state['exp_avg'] = torch.zeros_like(p.data, device='cpu')
# gradient variances
state['exp_avg_sq'] = torch.zeros_like(p.data, device='cpu')

exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
state['step'] += 1

if fp16_param_groups is not None:
p_fp16 = fp16_param_groups[group_id][param_id]
ds_opt_adam.adam_update_copy(self.opt_id,
p.data,
grad,
exp_avg,
exp_avg_sq,
p_fp16)
else:
ds_opt_adam.adam_update(self.opt_id,
p.data,
grad,
exp_avg,
exp_avg_sq)
return loss
33 changes: 33 additions & 0 deletions deepspeed/pt/deepspeed_zero_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import torch
from torch.autograd import Variable
import collections


def async_migrate_to(obj, dev, main_stream=None):
if torch.is_tensor(obj):
obj = Variable(obj)
if isinstance(obj, Variable):
v = obj.cuda(dev, async=True)
if main_stream is not None:
v.data.record_stream(main_stream)
return v
elif isinstance(obj, collections.Mapping):
return {k: async_copy_to(o, dev, main_stream) for k, o in obj.items()}
elif isinstance(obj, collections.Sequence):
return [async_copy_to(o, dev, main_stream) for o in obj]
else:
return obj


def async_copy_to(obj, dev, main_stream=None):
if torch.is_tensor(obj):
obj = Variable(obj)
if isinstance(obj, Variable):
target = torch.empty_like(obj, device=dev).copy_(obj)
if main_stream is not None:
target.data.record_stream(main_stream)
return target
elif isinstance(obj, collections.Mapping):
return {k: async_copy_to(o, dev, main_stream) for k, o in obj.items()}
elif isinstance(obj, collections.Sequence):
return [async_copy_to(o, dev, main_stream) for o in obj]
38 changes: 14 additions & 24 deletions deepspeed/runtime/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,16 @@
from deepspeed.runtime.fp16.loss_scaler import INITIAL_LOSS_SCALE, SCALE_WINDOW, DELAYED_SHIFT, MIN_LOSS_SCALE
from deepspeed.runtime.config_utils import get_scalar_param, dict_raise_error_on_duplicate_keys
from deepspeed.runtime.zero.config import DeepSpeedZeroConfig
from deepspeed.runtime.zero.constants import *
from deepspeed.runtime.activation_checkpointing.config import DeepSpeedActivationCheckpointingConfig
from deepspeed.utils import logger

TENSOR_CORE_ALIGN_SIZE = 8
ONEBIT_ADAM_OPTIMIZER = 'onebitadam'
ADAM_OPTIMIZER = 'adam'
LAMB_OPTIMIZER = 'lamb'
DEEPSPEED_OPTIMIZERS = [ADAM_OPTIMIZER, LAMB_OPTIMIZER, ONEBIT_ADAM_OPTIMIZER]
ONEBIT_ADAM_OPTIMIZER = 'onebitadam'
DEEPSPEED_ADAM = 'deepspeed_adam'
DEEPSPEED_OPTIMIZERS = [ADAM_OPTIMIZER, LAMB_OPTIMIZER, ONEBIT_ADAM_OPTIMIZER, DEEPSPEED_ADAM]


def get_amp_enabled(param_dict):
Expand Down Expand Up @@ -111,22 +113,9 @@ def get_zero_optimization(param_dict):


def get_zero_reduce_scatter(param_dict):
return get_scalar_param(param_dict, ZERO_REDUCE_SCATTER, ZERO_REDUCE_SCATTER_DEFAULT)


def get_zero_max_elements_per_comm(param_dict):
return get_scalar_param(param_dict,
ZERO_MAX_ELEMENTS_PER_COMM,
ZERO_MAX_ELEMENTS_PER_COMM_DEFAULT)


def get_allgather_size(param_dict):
return get_scalar_param(param_dict,
ALLGATHER_SIZE,
ALLGATHER_SIZE_DEFAULT) if get_scalar_param(
param_dict,
ALLGATHER_SIZE,
ALLGATHER_SIZE_DEFAULT) > 0 else ALLGATHER_SIZE_DEFAULT
ZERO_OPTIMIZATION_REDUCE_SCATTER,
ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT)


def get_allreduce_always_fp32(param_dict):
Expand Down Expand Up @@ -493,8 +482,6 @@ def _initialize_params(self, param_dict):
self.gradient_predivide_factor = get_gradient_predivide_factor(param_dict)
self.sparse_gradients_enabled = get_sparse_gradients_enabled(param_dict)

self.allgather_size = get_allgather_size(param_dict)

self.zero_config = DeepSpeedZeroConfig(param_dict)
self.zero_optimization_stage = self.zero_config.stage
self.zero_enabled = self.zero_optimization_stage > 0
Expand Down Expand Up @@ -628,15 +615,18 @@ def print(self, name):
':'))))

def _do_error_check(self):
if self.zero_enabled:
assert self.fp16_enabled, "DeepSpeedConfig: ZeRO is only supported if fp16 is enabled"
assert self.zero_optimization_stage <= MAX_STAGE_ZERO_OPTIMIZATION, "DeepSpeedConfig: Maximum supported ZeRO stage is {}".format(MAX_STAGE_ZERO_OPTIMIZATION)

assert self.train_micro_batch_size_per_gpu, "DeepSpeedConfig: {} is not defined".format(TRAIN_MICRO_BATCH_SIZE_PER_GPU)

assert self.gradient_accumulation_steps, 'DeepSpeedConfig: {} is not defined'.format(
assert self.gradient_accumulation_steps, "DeepSpeedConfig: {} is not defined".format(
GRADIENT_ACCUMULATION_STEPS)

if self.zero_enabled:
assert self.fp16_enabled, "DeepSpeedConfig: ZeRO is only supported if fp16 is enabled"
assert self.zero_optimization_stage <= MAX_STAGE_ZERO_OPTIMIZATION, "DeepSpeedConfig: Maximum supported ZeRO stage is {}".format(MAX_STAGE_ZERO_OPTIMIZATION)
if self.zero_config.cpu_offload is True:
assert self.zero_optimization_stage == ZERO_OPTIMIZATION_GRADIENTS, "DeepSpeedConfig: cpu-offload supported ZeRO stage is {}".format(ZERO_OPTIMIZATION_GRADIENTS)
#assert self.gradient_accumulation_steps == 1, "DeepSpeedConfig: {}is not supported for {}".format(GRADIENT_ACCUMULATION_STEPS, ZERO_OPTIMIZATION_CPU_OFFLOAD)

def _do_warning_check(self):
fp16_enabled = self.fp16_enabled or self.zero_enabled

Expand Down
Loading

0 comments on commit ad423f8

Please sign in to comment.