ZeRO-Offload (squash) (#381)

* ZeRO-Offload v1 (squash) (#345) * update DSE to point to ZeRO-Offload staging * ZeRO-2 enable CPU offload (#313) * cpu-offload * update * deleted: deepspeed/pt/deepspeed_zero_optimizer_cpuoffload.py modified: deepspeed/pt/fp16_unfused_optimizer.py new file: install_output.txt modified: tests/unit/test_dynamic_loss_scale.py * modified: deepspeed/pt/deepspeed_zero_optimizer.py * update * modified: deepspeed/pt/deepspeed_cpu_adam.py modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: tests/unit/test_checkpointing.py modified: tests/unit/test_fp16.py * deleted: install_output.txt * modified: deepspeed/pt/fp16_unfused_optimizer.py modified: tests/unit/test_dynamic_loss_scale.py * modified: deepspeed/pt/deepspeed_cpu_adam.py * modified: deepspeed/pt/deepspeed_zero_optimizer.py * modified: deepspeed/pt/deepspeed_cpu_adam.py modified: deepspeed/pt/deepspeed_zero_optimizer.py * deleted: deepspeed_cpu_adam.py modified: deepspeed_light.py modified: deepspeed_zero_optimizer.py ../../deepspeed_zero_optimizer_cpu_offload.py * modified: deepspeed/pt/deepspeed_light.py * modified: deepspeed/pt/deepspeed_light.py modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: deepspeed/pt/deepspeed_zero_utils.py modified: tests/unit/test_fp16.py * modified: deepspeed/pt/deepspeed_config.py modified: deepspeed/pt/deepspeed_light.py modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: tests/unit/test_checkpointing.py modified: tests/unit/test_fp16.py * modified: deepspeed/pt/deepspeed_checkpointing.py * update DSE to ZeRO-Offload commit Co-authored-by: Jeff Rasley <[email protected]> * Enable ZeRO checkpointing for ZeRO-Offload (#337) * Enable ZeRO checkpointing for ZeRO-Offload Fix unit tests Bump DSE to 33b9fb77c8cecdb49118188890f662526d8e9397 * Fix accidental revert * Add ZeRO-Offload checkpointing model tests (#344) * Enable ZeRO checkpointing for ZeRO-Offload Fix unit tests Bump DSE to 33b9fb77c8cecdb49118188890f662526d8e9397 * Fix accidental revert * Fix ZeRO-Offload checkpointing bug when change gpu count Add checkpointing model tests for ZeRO-Offload Remove optimizer key from Megatron model tests Use different deepspeed master port for Megatron model tests Co-authored-by: Jie <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * update DSE to staging for zero-dual * Update test_sparse_attention.py * Assert ZeRO-Offload+gradient accumulation (#347) * Adding link to Sparse Attention in Navigation page (#355) * adding link to Sparse Attention in Navigation page * Correctness and perf fixes (#354) * Update test_sparse_attention.py * jren changes * Merge with correctness/perf fixes * Formatting fixes Co-authored-by: Jeff Rasley <[email protected]> * add cpu adam optimizer (#356) * add cpu adam optimizer * run precommit * clean adam_test * add accuracy test for adam * make the adam unit test work with random params and grads and for more steps * Samyamr/zero offload correctness (#359) * fixing gradient accumulation for zero offload * Bug fixes. ZeRO Stage 1,2 and Offload all produce the same loss with gradient accumulation step of 2 * Import path fixes + conditional imports (#358) * use relative imports and add support for conditional op imports * formatting and llvm command check change * fix remaining absolute import * hide the isntalled ops var * fix unit tests Co-authored-by: Reza Yazdani <[email protected]> * Enable contiguous gradients for cpu_offload * Allocating CPU memory directly on CPU without transfering them from GPU (#360) * Allocating CPU memory directly on CPU without transfering them from GPU * formatting fixes * change gpt2 pretrain to have DeepSpeed adam (#361) Co-authored-by: Reza Yazdani <[email protected]> * Jekyll installation instructions (#351) * Generalize detection of ZeRO supported optimizers (#349) * Improve test for ZeRO supported optimizers * Rename test function * Format fixes * Add model tests that wraps client FusedAdam with fused fp16 optimizer * Format fixes * everything is working * fixing the cpu_adam API and add deepspeed_adam flag in config.py (#365) * fixing the cpu_adam API and add deepspeed_adam flag in config.py * run precommit * fixing adam copy fp16-param-add more compile flags for cpu_adam * run precommit * fix variance indexes * fix array-sizes * ZeRO-Offload passing model functionality tests (#366) * cpu_offload enables overlap_comm and contiguous_gradients Remove non-portable tensor.mul_() * Model functionality tests now passing * Move to perf tests folder * move adam_test * rename perf test * fixing adam copy fp16-param and add more compile flags for cpu_adam (#367) * fixing adam copy fp16-param-add more compile flags for cpu_adam * run precommit * fix variance indexes * fix array-sizes * move adam_test * rename perf test * Perf tests * BumpDSE * fixed a typo; this was fixed before but seems like it has been lost in the refactor (#364) * Move code quality tests to Azure-hosted agents. (#368) * add casting kernel * run precommit * revert changes * revert changes * ZeRO-Offload: Integration code fixes (#370) * Various correctness fixes * Format fixes * Update installation instructions (#362) * Update Sparse Attention Tutorial (#357) * adding BingSqaud e2e test * updating the draft test; bring final step under try section * finalizinf test for base deepspeed and deepspeed with ZeRO * applying the comment (thanks Jeff); fixed formatting * update Sparse Attention Tutorial * fixed few issues and applied comments for better organization and readability * updated sparse attention tutorial with making how to use section incremental; applying more comments Co-authored-by: arashashari <[email protected]> * fixing corner cases (#371) * fix adam perormance (#372) * fixing corner cases * revert to the previous perf for adam * adam high performance * run precommit * ZeRO-Offload passing model tests (#374) * Add ZeRO-Offload model tests Restrict optimizer update+copy to DeepSpeedCPUAdam * Format fixes * Increate bucket size scaler * fix cpu adam compilation for AVX2 (#378) * fixing the compilation error for AVX2 architecture * running precommit * adding cpufeature to requirements * Update install.sh * Update install.sh * include cpu-adam in the features * update features * update features Co-authored-by: Jeff Rasley <[email protected]> * Move code quality tests to Azure-hosted agents. (#368) * Bump DSE * adding sparse attention to feature index page (#377) * support avx2 by default (#383) * add DS_BUILD_AVX512 flag and update the feature part accordingly * run precommit Co-authored-by: Jie <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Arash Ashari <[email protected]> Co-authored-by: RezaYazdaniAminabadi <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Shaden Smith <[email protected]> Co-authored-by: arashashari <[email protected]>
deepspeedai · Sep 9, 2020 · ad423f8 · ad423f8
1 parent 01726ce
commit ad423f8
Show file tree

Hide file tree

Showing 45 changed files with 3,738 additions and 1,890 deletions.
diff --git a/DeepSpeedExamples b/DeepSpeedExamples
diff --git a/csrc/adam/cpu_adam.cpp b/csrc/adam/cpu_adam.cpp
diff --git a/csrc/adam/custom_cuda_kernel.cu b/csrc/adam/custom_cuda_kernel.cu
@@ -0,0 +1,35 @@
+
+
+#include "custom_cuda_layers.h"
+
+__global__ void param_update_kernel(const float* input, __half* output, int size)
+{
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+
+    int id = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (id < size) {
+        float4 data = input_cast[id];
+        float2 cast_data;
+        __half* output_h = reinterpret_cast<__half*>(&cast_data);
+
+        output_h[0] = (__half)data.x;
+        output_h[1] = (__half)data.y;
+        output_h[2] = (__half)data.z;
+        output_h[3] = (__half)data.w;
+
+        output_cast[id] = cast_data;
+    }
+}
+
+void launch_param_update(const float* input, __half* output, int size, cudaStream_t stream)
+{
+    int threads = 512;
+
+    size /= 4;
+    dim3 grid_dim((size - 1) / threads + 1);
+    dim3 block_dim(threads);
+
+    param_update_kernel<<<grid_dim, block_dim, 0, stream>>>(input, output, size);
+}
diff --git a/csrc/includes/cpu_adam.h b/csrc/includes/cpu_adam.h
@@ -0,0 +1,118 @@
+#pragma once
+
+#include <cpuid.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime_api.h>
+#include <stdio.h>
+#include <x86intrin.h>
+#include <cassert>
+#include "context.h"
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        cudaError_t error_code = callstr;                                                      \
+        if (error_code != cudaSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define TILE (1024 * 1024 * 1024)
+
+#if defined(__AVX512__)
+#define SIMD_STORE(a, d) _mm512_storeu_ps(a, d)
+#define SIMD_LOAD(x) _mm512_loadu_ps(x)
+#define SIMD_SET(x) _mm512_set1_ps(x)
+#define SIMD_MUL(x, y) _mm512_mul_ps(x, y)
+#define SIMD_FMA(x, y, c) _mm512_fmadd_ps(x, y, c)
+#define SIMD_SQRT(x) _mm512_sqrt_ps(x)
+#define SIMD_DIV(x, y) _mm512_div_ps(x, y)
+#define SIMD_WIDTH 16
+#else
+#if defined(__AVX256__)
+#define SIMD_STORE(a, d) _mm256_storeu_ps(a, d)
+#define SIMD_LOAD(x) _mm256_loadu_ps(x)
+#define SIMD_SET(x) _mm256_set1_ps(x)
+#define SIMD_MUL(x, y) _mm256_mul_ps(x, y)
+#define SIMD_FMA(x, y, c) _mm256_fmadd_ps(x, y, c)
+#define SIMD_SQRT(x) _mm256_sqrt_ps(x)
+#define SIMD_DIV(x, y) _mm256_div_ps(x, y)
+#define SIMD_WIDTH 8
+#endif
+#endif
+
+class Adam_Optimizer {
+public:
+    Adam_Optimizer(float alpha = 1e-3,
+                   float betta1 = 0.9,
+                   float betta2 = 0.999,
+                   float eps = 1e-8,
+                   float weight_decay = 0)
+        : _alpha(alpha),
+          _betta1(betta1),
+          _betta2(betta2),
+          _eps(eps),
+          _weight_decay(weight_decay),
+          _betta1_t(1.0),
+          _betta2_t(1.0),
+          _buf_index(false)
+    {
+        cudaMallocHost((void**)_doubled_buffer, TILE * sizeof(float));
+        cudaMallocHost((void**)(_doubled_buffer + 1), TILE * sizeof(float));
+    }
+    ~Adam_Optimizer()
+    {
+        cudaFreeHost(_doubled_buffer[0]);
+        cudaFreeHost(_doubled_buffer[1]);
+    }
+    void Step(float* _params,
+              float* grads,
+              float* _exp_avg,
+              float* _exp_avg_sq,
+              size_t param_size,
+              __half* dev_param = nullptr);
+    void Step_4(float* _params,
+                float* grads,
+                float* _exp_avg,
+                float* _exp_avg_sa,
+                size_t param_size,
+                __half* dev_param = nullptr);
+    void Step_8(float* _params,
+                float* grads,
+                float* _exp_avg,
+                float* _exp_avg_sq,
+                size_t _param_size,
+                __half* dev_params = nullptr);
+    inline void IncrementStep()
+    {
+        _betta1_t *= _betta1;
+        _betta2_t *= _betta2;
+    }
+
+private:
+#if defined(__AVX512__) or defined(__AVX256__)
+    union AVX_Data {
+#if defined(__AVX512__)
+        __m512 data;
+#else
+        __m256 data;
+#endif
+        // float data_f[16];
+    };
+#endif
+
+    float _alpha;
+    float _betta1;
+    float _betta2;
+    float _eps;
+    float _weight_decay;
+
+    float _betta1_t;
+    float _betta2_t;
+
+    float* _doubled_buffer[2];
+    bool _buf_index;
+};
diff --git a/csrc/includes/custom_cuda_layers.h b/csrc/includes/custom_cuda_layers.h
@@ -264,3 +264,5 @@ void launch_fuse_transpose_bias_kernel(const T* inp,
                                        int rows,
                                        int cols,
                                        cudaStream_t stream);
+
+void launch_param_update(const float* input, __half* output, int size, cudaStream_t stream);
diff --git a/deepspeed/__init__.py b/deepspeed/__init__.py
@@ -4,16 +4,18 @@
 import sys
 import types
 
-from deepspeed.runtime.engine import DeepSpeedEngine
-from deepspeed.runtime.engine import ADAM_OPTIMIZER, LAMB_OPTIMIZER
-from deepspeed.runtime.lr_schedules import add_tuning_arguments
-from deepspeed.runtime.config import DeepSpeedConfig
-from deepspeed.runtime.activation_checkpointing import checkpointing
-from deepspeed.ops.transformer import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
-from deepspeed.utils import logger
+from . import ops
+
+from .runtime.engine import DeepSpeedEngine
+from .runtime.engine import ADAM_OPTIMIZER, LAMB_OPTIMIZER, DEEPSPEED_ADAM
+from .runtime.lr_schedules import add_tuning_arguments
+from .runtime.config import DeepSpeedConfig
+from .runtime.activation_checkpointing import checkpointing
+from .ops.transformer import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
+from .utils import logger
 
 try:
-    from deepspeed.git_version_info import version, git_hash, git_branch
+    from .git_version_info import version, git_hash, git_branch
 except ImportError:
     version = "0.0.0+unknown"
     git_hash = None

diff --git a/deepspeed/ops/__init__.py b/deepspeed/ops/__init__.py
@@ -0,0 +1,7 @@
+from ..git_version_info import installed_ops as __installed_ops__
+from . import lamb
+from . import transformer
+if __installed_ops__['sparse-attn']:
+    from . import sparse_attention
+if __installed_ops__['cpu-adam']:
+    from . import adam
diff --git a/deepspeed/ops/adam/__init__.py b/deepspeed/ops/adam/__init__.py
@@ -0,0 +1 @@
+from .cpu_adam import DeepSpeedCPUAdam
diff --git a/deepspeed/ops/adam/cpu_adam.py b/deepspeed/ops/adam/cpu_adam.py
@@ -0,0 +1,81 @@
+import math
+import torch
+import importlib
+
+ds_opt_adam = None
+
+
+class DeepSpeedCPUAdam(torch.optim.Optimizer):
+
+    optimizer_id = 0
+
+    def __init__(self,
+                 model_params,
+                 lr=1e-3,
+                 betas=(0.9,
+                        0.999),
+                 eps=1e-8,
+                 weight_decay=0,
+                 amsgrad=False):
+
+        default_args = dict(lr=lr,
+                            betas=betas,
+                            eps=eps,
+                            weight_decay=weight_decay,
+                            amsgrad=amsgrad)
+        super(DeepSpeedCPUAdam, self).__init__(model_params, default_args)
+
+        self.opt_id = DeepSpeedCPUAdam.optimizer_id
+        DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
+
+        global ds_opt_adam
+        ds_opt_adam = importlib.import_module('deepspeed.ops.adam.cpu_adam_op')
+        ds_opt_adam.create_adam(self.opt_id, lr, betas[0], betas[1], eps, weight_decay)
+
+    def __setstate__(self, state):
+        super(DeepSpeedCPUAdam, self).__setstate__(state)
+        for group in self.param_groups:
+            group.setdefault('amsgrad', False)
+
+    @torch.no_grad()
+    def step(self, closure=None, fp16_param_groups=None):
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+
+        for group_id, group in enumerate(self.param_groups):
+            for param_id, p in enumerate(group['params']):
+
+                if p.grad is None:
+                    continue
+
+                grad = p.grad.data
+                state = self.state[p]
+                # State initialization
+                if len(state) == 0:
+                    print(f'group {group_id} param {param_id} = {p.numel()}')
+                    state['step'] = 0
+                    # gradient momentums
+                    state['exp_avg'] = torch.zeros_like(p.data, device='cpu')
+                    # gradient variances
+                    state['exp_avg_sq'] = torch.zeros_like(p.data, device='cpu')
+
+                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+                state['step'] += 1
+
+                if fp16_param_groups is not None:
+                    p_fp16 = fp16_param_groups[group_id][param_id]
+                    ds_opt_adam.adam_update_copy(self.opt_id,
+                                                 p.data,
+                                                 grad,
+                                                 exp_avg,
+                                                 exp_avg_sq,
+                                                 p_fp16)
+                else:
+                    ds_opt_adam.adam_update(self.opt_id,
+                                            p.data,
+                                            grad,
+                                            exp_avg,
+                                            exp_avg_sq)
+        return loss
diff --git a/deepspeed/pt/deepspeed_zero_utils.py b/deepspeed/pt/deepspeed_zero_utils.py
@@ -0,0 +1,33 @@
+import torch
+from torch.autograd import Variable
+import collections
+
+
+def async_migrate_to(obj, dev, main_stream=None):
+    if torch.is_tensor(obj):
+        obj = Variable(obj)
+    if isinstance(obj, Variable):
+        v = obj.cuda(dev, async=True)
+        if main_stream is not None:
+            v.data.record_stream(main_stream)
+        return v
+    elif isinstance(obj, collections.Mapping):
+        return {k: async_copy_to(o, dev, main_stream) for k, o in obj.items()}
+    elif isinstance(obj, collections.Sequence):
+        return [async_copy_to(o, dev, main_stream) for o in obj]
+    else:
+        return obj
+
+
+def async_copy_to(obj, dev, main_stream=None):
+    if torch.is_tensor(obj):
+        obj = Variable(obj)
+    if isinstance(obj, Variable):
+        target = torch.empty_like(obj, device=dev).copy_(obj)
+        if main_stream is not None:
+            target.data.record_stream(main_stream)
+        return target
+    elif isinstance(obj, collections.Mapping):
+        return {k: async_copy_to(o, dev, main_stream) for k, o in obj.items()}
+    elif isinstance(obj, collections.Sequence):
+        return [async_copy_to(o, dev, main_stream) for o in obj]
diff --git a/deepspeed/runtime/config.py b/deepspeed/runtime/config.py
@@ -10,14 +10,16 @@
 from deepspeed.runtime.fp16.loss_scaler import INITIAL_LOSS_SCALE, SCALE_WINDOW, DELAYED_SHIFT, MIN_LOSS_SCALE
 from deepspeed.runtime.config_utils import get_scalar_param, dict_raise_error_on_duplicate_keys
 from deepspeed.runtime.zero.config import DeepSpeedZeroConfig
+from deepspeed.runtime.zero.constants import *
 from deepspeed.runtime.activation_checkpointing.config import DeepSpeedActivationCheckpointingConfig
 from deepspeed.utils import logger
 
 TENSOR_CORE_ALIGN_SIZE = 8
-ONEBIT_ADAM_OPTIMIZER = 'onebitadam'
 ADAM_OPTIMIZER = 'adam'
 LAMB_OPTIMIZER = 'lamb'
-DEEPSPEED_OPTIMIZERS = [ADAM_OPTIMIZER, LAMB_OPTIMIZER, ONEBIT_ADAM_OPTIMIZER]
+ONEBIT_ADAM_OPTIMIZER = 'onebitadam'
+DEEPSPEED_ADAM = 'deepspeed_adam'
+DEEPSPEED_OPTIMIZERS = [ADAM_OPTIMIZER, LAMB_OPTIMIZER, ONEBIT_ADAM_OPTIMIZER, DEEPSPEED_ADAM]
 
 
 def get_amp_enabled(param_dict):
@@ -111,22 +113,9 @@ def get_zero_optimization(param_dict):
 
 
 def get_zero_reduce_scatter(param_dict):
-    return get_scalar_param(param_dict, ZERO_REDUCE_SCATTER, ZERO_REDUCE_SCATTER_DEFAULT)
-
-
-def get_zero_max_elements_per_comm(param_dict):
     return get_scalar_param(param_dict,
-                            ZERO_MAX_ELEMENTS_PER_COMM,
-                            ZERO_MAX_ELEMENTS_PER_COMM_DEFAULT)
-
-
-def get_allgather_size(param_dict):
-    return get_scalar_param(param_dict,
-                            ALLGATHER_SIZE,
-                            ALLGATHER_SIZE_DEFAULT) if get_scalar_param(
-                                param_dict,
-                                ALLGATHER_SIZE,
-                                ALLGATHER_SIZE_DEFAULT) > 0 else ALLGATHER_SIZE_DEFAULT
+                            ZERO_OPTIMIZATION_REDUCE_SCATTER,
+                            ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT)
 
 
 def get_allreduce_always_fp32(param_dict):
@@ -493,8 +482,6 @@ def _initialize_params(self, param_dict):
         self.gradient_predivide_factor = get_gradient_predivide_factor(param_dict)
         self.sparse_gradients_enabled = get_sparse_gradients_enabled(param_dict)
 
-        self.allgather_size = get_allgather_size(param_dict)
-
         self.zero_config = DeepSpeedZeroConfig(param_dict)
         self.zero_optimization_stage = self.zero_config.stage
         self.zero_enabled = self.zero_optimization_stage > 0
@@ -628,15 +615,18 @@ def print(self, name):
                                    ':'))))
 
     def _do_error_check(self):
-        if self.zero_enabled:
-            assert self.fp16_enabled, "DeepSpeedConfig: ZeRO is only supported if fp16 is enabled"
-            assert self.zero_optimization_stage <= MAX_STAGE_ZERO_OPTIMIZATION, "DeepSpeedConfig: Maximum supported ZeRO stage is {}".format(MAX_STAGE_ZERO_OPTIMIZATION)
-
         assert self.train_micro_batch_size_per_gpu, "DeepSpeedConfig: {} is not defined".format(TRAIN_MICRO_BATCH_SIZE_PER_GPU)
 
-        assert self.gradient_accumulation_steps, 'DeepSpeedConfig: {} is not defined'.format(
+        assert self.gradient_accumulation_steps, "DeepSpeedConfig: {} is not defined".format(
             GRADIENT_ACCUMULATION_STEPS)
 
+        if self.zero_enabled:
+            assert self.fp16_enabled, "DeepSpeedConfig: ZeRO is only supported if fp16 is enabled"
+            assert self.zero_optimization_stage <= MAX_STAGE_ZERO_OPTIMIZATION, "DeepSpeedConfig: Maximum supported ZeRO stage is {}".format(MAX_STAGE_ZERO_OPTIMIZATION)
+            if self.zero_config.cpu_offload is True:
+                assert self.zero_optimization_stage == ZERO_OPTIMIZATION_GRADIENTS, "DeepSpeedConfig: cpu-offload supported ZeRO stage is {}".format(ZERO_OPTIMIZATION_GRADIENTS)
+                #assert self.gradient_accumulation_steps == 1, "DeepSpeedConfig: {}is not supported for {}".format(GRADIENT_ACCUMULATION_STEPS, ZERO_OPTIMIZATION_CPU_OFFLOAD)
+
     def _do_warning_check(self):
         fp16_enabled = self.fp16_enabled or self.zero_enabled