Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minjiaz/zero offload #382

Merged
merged 7 commits into from
Sep 10, 2020
Merged

Minjiaz/zero offload #382

merged 7 commits into from
Sep 10, 2020

Conversation

minjiaz
Copy link
Contributor

@minjiaz minjiaz commented Sep 9, 2020

Added a feature page for ZeRO-Offload, which is supposed to show under "what's new". My understanding is that it shall be automatically picked up by the embedded script in index.md and shown on the front page, so no additional hyperlink is required. Is that correct?

@jeffra jeffra merged commit 59ce90d into master Sep 10, 2020
stephen-youn added a commit that referenced this pull request Jun 14, 2023
* Add residual_add triton op

* add support of gptj style models to triton residual_add kernel

* fix the residual_add tests

* Add support of end to end run for residal_add triton kernels

* Fix the MLP output tensor's shape

* Fix the output tensor of residual_add_func python call

* triton matmul kernels with python wrapper class added with pytests

* clean-up and make it read autotune table when importing

* fixed import problems with the naming

* enable update_autotune_table for every forward in matmul

* a int4 into int8 weight packing function added
test parameters with alignment only (i.e. integer multiple of block_size in matmul kernel), this will be further investigated

* lint

* quantization added
int8-packed-int4-fp16 matmul-block-deq added
illegal cuda mem access bug in triton matmul kernel fixed (i.e. a mem boundary problem)

* add torch block qunatization

* dual quantization matmul added

* cleanup, fix for lint

* documentation
lint fix

* README added

* typo

* updated the kernel to have fused bias additioin and activation too

* Add residual_add triton op

* modified quantization to take additional bits, more than int8

* enable triton residual_add kernel in DS MLP

* Add flash attention kernel and glue code

* additional scale-norm added for weight

* a temporary example for quantization added

* comments

* use the exact same ds quantizer as reference

* added scale-norm (i.e. scale-of-scale) to both triton/torch version

* snr check with fugsed-deq-gemm for block_deq and dual_block_deq

* makes matmul kernels work for a6000 with smaller mem
w8a8/w4a8 with sym block quantization on activation and row(or col)-wise quatnziation on weight works (snr test added)

* Add layer norm triton kernel

* Add gelu triton kernel

* Add softmax triton kernel

* Rename flash attn api

* add triton gemm kernels

* fix formatting of triton kernels

* Add matmul triton kernels

* Updated Triton Gelu to use non-approx computation

* Updated Triton Gemm for f16 bias-add parity

* Add DS triton encoder layer

* Updated Softmax to work around block size 1

* fix the issue caused by merge conflict

* Add trition layer norm unittests

* dual-qblock snr verified too

* Add triton gelu kernel unittests

* Add triton softmax kernel unittests

* fix flash kernels formatting (#382)

* Add triton dependency to unittests workflow (#381)

* w8a8 and w8a4 matmul with block quantization verified

* Allow Gemm & MatMul to take arbitrary dimensions

* Add triton matmul kernel unittests

* fix triton dependency in github CI workflows

* Fix matmul launching grid

* fix formatting

* Add triton gemm kernel unittests

* modified dual-qblock to support wider scale_bits with int64 acc and vec-ops, which caused perf degradation
workaround is to use "v2" kernel added with internal shift ops but not enabled yet

* fix residual in gemm_3d kernel

* Add flash attention trition kernels unit tests

* test_matmul and test_gemm pass (but with smaller coverage as mentioned in the code)
float32 can be supported later

* added 'triton_gemm_eval.py'
it is temporary script to evaluate accuracy of the triton matmul against the torch matmul

* typo

* typo

* root-caused the parity error with fused_gelu. it is not with gelu but with residual-addition.
disabled residual-addition and it still needs debugging

* location of residual addition in reference modified to be after the activation

* fixed index typo in the snr plot

* Fix trition attention kernel unit tests

* fix formatting

* added batch support in matmul
row/col-wise quantization matmul debugged

* fixed bugs in the unit tests after the batch support change and so on
test_int8_int8_fp_matmul_dual_block_deq still fails and need further debugging though

* weight-only quantizatioin example and test are added to check_snr

* matmul_ext basic check added as unit test under tests/unit

* move triton ops under inference/triton

* restore triton_ops.py

* import path correction

* restore ds_mlp and ds_attention

* shaping bug with batching in matmul_ext fixed
changed the gelu computation to use libdevice.erf instead of approx with sigmoid
(otherwise, roberta unit test fails)

* triton ops added with an option in config to use it with op_binding and config option

* Triton transformer added: InferenceTransformerFactory, TritonTransformer, TritonSelfAttention, TritonMLP and so forth

* Triton wrapper classes added

* added simple triton eval scripts

* rename the new benchmark script for triton-bert

* added triton attention, triton layer-norm/softmax

* adds tests to measure attention perf in triton and others

* changed triton flash attn function name

* attention set to use triton non-flash by default

* enable triton for bert

* made udpate_autotable be false by default because it degrade the perf

* temp commit with debugging/profiling codes

* temporary debugging/profiling code lines added, need to be cleaned up later

* clean-up

* unit tests for triton inference ops are now passing

* removed unnecessary triton kernels

* test_inference passes

* removed debugging/profiling codes

* triton==2.0.0.dev20221202

* clean-up for formating check pass
added layer_norm test without residual-add

* set triton version requirement

* further clean-up

* removed redundant files

* readme for triton matmul

* clean-up and add more test for triton-matmul

* typo

* removed another obsolete triton kernels and tests

* removed unnecessary TransformerInferenceFactory class

* removed obsolete test

* formatting check, cleanup

* formatting fix: added copyright to the head

* formatting: missing lticense added

* add pytest skip condition to test_matmul_ext

* formatting fix

* formatting

* added --forked option to inference_ops unit pytests

* Revert "added --forked option to inference_ops unit pytests"

This reverts commit 743b86d354b041172b06e4a8505f43ddd4c2544a.

* changed the pytest mark for softmax to be inference_ops

* formatting fix

* cleanup comments

* add missing import

* keep only fp16 matmuls because it's out of this PR's scope
int8-based gemm kernels will be added later

* removed the previous matmul_ext test

* triton quantization kernel removed too

* clean up comments

* added comments for license

* triton matmul always read the autotune table when imported and write the final table when closing

* modfied triton kernels to have a new transposed_model arg

* added license note to files

* set default mlp kernel to be cuda as it's better than triton kernel with bert

* adds changes missed from the prev commit

* added license notes
increased DEEPSPEED_TEST_TIMEOUT from 600 to 900 for triton compilation

* added unit test for triton attention

* moved tests in layer_norm.py to test_layer_norm.py

* removed commented code lines

* removed triton from the main requirement as commented in PR

* follow PascalCase convention in class naming as suggested from pr review

* changes to make deepspeed work without triton
specifically, resolves error with importing any triton ops
added code lines that check the availabilty of triton and skip the tests if it's not

* added a feature to run triton autotune at initialization, i.e., at op-building phase

* fix for the lint/formatting
added " # noqa: F401"

* move triton-bert-benchmark.py to microsoft/DeepSpeedExamples

* modify the code as suggested from PR

* make DEEPSPEED_TEST_TIMEOUT in unit test back to 600s

* made an optioni to skip triton-autotune in config

* lint fix for formatting

* removed repeated has_triton when importing triton
also the change for pr comment

* removed duplicated triton_autotune arg passing

* upgrade to triton 2.0
pydantic.validator for use_triton

* move triton specific op mapping into model_implementation as commented from PR

* removed commented lines

* need to cite where the file came from, as commented from the PR review

* change for the recent merge with the master

* qkv-gemm change to make distilbert work after the merge with the master

* format fix

* fix triton attention for qkv passing for non-pre-norm
requirements all use triton2.0.0

* skip autotune in test_matmul and test_attention with triton

* formatting with pre-commit

* add config for v100 test in matmul_4d kernel (small shared mem requirement)

* inject tritn kernels only in bert and let it inform it through log_dist
set triton to be the latest from requirements

* reduced the config and added mem check for matmul_4d

* added README.md tutorial page for triton-deepspeed

* typi in README

* refine README

* refine readme

* refine readme

* refine readme

* "Fix apex install bugs #3741"

---------

Co-authored-by: Arash Bakhtiari <[email protected]>
Co-authored-by: Stephen Youn <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Ethan Doe <[email protected]>
Co-authored-by: yidoe <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
@mrwyattii mrwyattii deleted the minjiaz/zero-offload branch July 7, 2023 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants