Minjiaz/zero offload #382

minjiaz · 2020-09-09T18:32:31Z

Added a feature page for ZeRO-Offload, which is supposed to show under "what's new". My understanding is that it shall be automatically picked up by the embedded script in index.md and shown on the front page, so no additional hyperlink is required. Is that correct?

* Add residual_add triton op * add support of gptj style models to triton residual_add kernel * fix the residual_add tests * Add support of end to end run for residal_add triton kernels * Fix the MLP output tensor's shape * Fix the output tensor of residual_add_func python call * triton matmul kernels with python wrapper class added with pytests * clean-up and make it read autotune table when importing * fixed import problems with the naming * enable update_autotune_table for every forward in matmul * a int4 into int8 weight packing function added test parameters with alignment only (i.e. integer multiple of block_size in matmul kernel), this will be further investigated * lint * quantization added int8-packed-int4-fp16 matmul-block-deq added illegal cuda mem access bug in triton matmul kernel fixed (i.e. a mem boundary problem) * add torch block qunatization * dual quantization matmul added * cleanup, fix for lint * documentation lint fix * README added * typo * updated the kernel to have fused bias additioin and activation too * Add residual_add triton op * modified quantization to take additional bits, more than int8 * enable triton residual_add kernel in DS MLP * Add flash attention kernel and glue code * additional scale-norm added for weight * a temporary example for quantization added * comments * use the exact same ds quantizer as reference * added scale-norm (i.e. scale-of-scale) to both triton/torch version * snr check with fugsed-deq-gemm for block_deq and dual_block_deq * makes matmul kernels work for a6000 with smaller mem w8a8/w4a8 with sym block quantization on activation and row(or col)-wise quatnziation on weight works (snr test added) * Add layer norm triton kernel * Add gelu triton kernel * Add softmax triton kernel * Rename flash attn api * add triton gemm kernels * fix formatting of triton kernels * Add matmul triton kernels * Updated Triton Gelu to use non-approx computation * Updated Triton Gemm for f16 bias-add parity * Add DS triton encoder layer * Updated Softmax to work around block size 1 * fix the issue caused by merge conflict * Add trition layer norm unittests * dual-qblock snr verified too * Add triton gelu kernel unittests * Add triton softmax kernel unittests * fix flash kernels formatting (#382) * Add triton dependency to unittests workflow (#381) * w8a8 and w8a4 matmul with block quantization verified * Allow Gemm & MatMul to take arbitrary dimensions * Add triton matmul kernel unittests * fix triton dependency in github CI workflows * Fix matmul launching grid * fix formatting * Add triton gemm kernel unittests * modified dual-qblock to support wider scale_bits with int64 acc and vec-ops, which caused perf degradation workaround is to use "v2" kernel added with internal shift ops but not enabled yet * fix residual in gemm_3d kernel * Add flash attention trition kernels unit tests * test_matmul and test_gemm pass (but with smaller coverage as mentioned in the code) float32 can be supported later * added 'triton_gemm_eval.py' it is temporary script to evaluate accuracy of the triton matmul against the torch matmul * typo * typo * root-caused the parity error with fused_gelu. it is not with gelu but with residual-addition. disabled residual-addition and it still needs debugging * location of residual addition in reference modified to be after the activation * fixed index typo in the snr plot * Fix trition attention kernel unit tests * fix formatting * added batch support in matmul row/col-wise quantization matmul debugged * fixed bugs in the unit tests after the batch support change and so on test_int8_int8_fp_matmul_dual_block_deq still fails and need further debugging though * weight-only quantizatioin example and test are added to check_snr * matmul_ext basic check added as unit test under tests/unit * move triton ops under inference/triton * restore triton_ops.py * import path correction * restore ds_mlp and ds_attention * shaping bug with batching in matmul_ext fixed changed the gelu computation to use libdevice.erf instead of approx with sigmoid (otherwise, roberta unit test fails) * triton ops added with an option in config to use it with op_binding and config option * Triton transformer added: InferenceTransformerFactory, TritonTransformer, TritonSelfAttention, TritonMLP and so forth * Triton wrapper classes added * added simple triton eval scripts * rename the new benchmark script for triton-bert * added triton attention, triton layer-norm/softmax * adds tests to measure attention perf in triton and others * changed triton flash attn function name * attention set to use triton non-flash by default * enable triton for bert * made udpate_autotable be false by default because it degrade the perf * temp commit with debugging/profiling codes * temporary debugging/profiling code lines added, need to be cleaned up later * clean-up * unit tests for triton inference ops are now passing * removed unnecessary triton kernels * test_inference passes * removed debugging/profiling codes * triton==2.0.0.dev20221202 * clean-up for formating check pass added layer_norm test without residual-add * set triton version requirement * further clean-up * removed redundant files * readme for triton matmul * clean-up and add more test for triton-matmul * typo * removed another obsolete triton kernels and tests * removed unnecessary TransformerInferenceFactory class * removed obsolete test * formatting check, cleanup * formatting fix: added copyright to the head * formatting: missing lticense added * add pytest skip condition to test_matmul_ext * formatting fix * formatting * added --forked option to inference_ops unit pytests * Revert "added --forked option to inference_ops unit pytests" This reverts commit 743b86d354b041172b06e4a8505f43ddd4c2544a. * changed the pytest mark for softmax to be inference_ops * formatting fix * cleanup comments * add missing import * keep only fp16 matmuls because it's out of this PR's scope int8-based gemm kernels will be added later * removed the previous matmul_ext test * triton quantization kernel removed too * clean up comments * added comments for license * triton matmul always read the autotune table when imported and write the final table when closing * modfied triton kernels to have a new transposed_model arg * added license note to files * set default mlp kernel to be cuda as it's better than triton kernel with bert * adds changes missed from the prev commit * added license notes increased DEEPSPEED_TEST_TIMEOUT from 600 to 900 for triton compilation * added unit test for triton attention * moved tests in layer_norm.py to test_layer_norm.py * removed commented code lines * removed triton from the main requirement as commented in PR * follow PascalCase convention in class naming as suggested from pr review * changes to make deepspeed work without triton specifically, resolves error with importing any triton ops added code lines that check the availabilty of triton and skip the tests if it's not * added a feature to run triton autotune at initialization, i.e., at op-building phase * fix for the lint/formatting added " # noqa: F401" * move triton-bert-benchmark.py to microsoft/DeepSpeedExamples * modify the code as suggested from PR * make DEEPSPEED_TEST_TIMEOUT in unit test back to 600s * made an optioni to skip triton-autotune in config * lint fix for formatting * removed repeated has_triton when importing triton also the change for pr comment * removed duplicated triton_autotune arg passing * upgrade to triton 2.0 pydantic.validator for use_triton * move triton specific op mapping into model_implementation as commented from PR * removed commented lines * need to cite where the file came from, as commented from the PR review * change for the recent merge with the master * qkv-gemm change to make distilbert work after the merge with the master * format fix * fix triton attention for qkv passing for non-pre-norm requirements all use triton2.0.0 * skip autotune in test_matmul and test_attention with triton * formatting with pre-commit * add config for v100 test in matmul_4d kernel (small shared mem requirement) * inject tritn kernels only in bert and let it inform it through log_dist set triton to be the latest from requirements * reduced the config and added mem check for matmul_4d * added README.md tutorial page for triton-deepspeed * typi in README * refine README * refine readme * refine readme * refine readme * "Fix apex install bugs #3741" --------- Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>

minjiaz added 2 commits September 9, 2020 18:25

adding feature page for ZeRO-Offload

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

3d4652f

Updated the ZeRO-Offload description page

a65753f

minjiaz requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners September 9, 2020 18:32

tjruwase approved these changes Sep 9, 2020

View reviewed changes

minjiaz and others added 5 commits September 9, 2020 19:04

adding ZeRO-Offload to the feature list

a44603d

Merge branch 'master' into minjiaz/zero-offload

48ab515

revert DSE

fe8d02c

replace press release url

b302211

Merge branch 'master' into minjiaz/zero-offload

510af80

jeffra approved these changes Sep 10, 2020

View reviewed changes

jeffra merged commit 59ce90d into master Sep 10, 2020

bobisapotato mentioned this pull request Jan 24, 2021

Another thing to merge. (MY EYES HURT) bobisai/DeepSpeed#1

Merged

mrwyattii deleted the minjiaz/zero-offload branch July 7, 2023 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minjiaz/zero offload #382

Minjiaz/zero offload #382

minjiaz commented Sep 9, 2020

Minjiaz/zero offload #382

Minjiaz/zero offload #382

Conversation

minjiaz commented Sep 9, 2020