Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support dynamic sequence length #320

Closed

Conversation

RezaYazdaniAminabadi
Copy link
Contributor

I did some changes to the transformer kernel code to support sequence-lengths dynamically.

@conglongli
Copy link
Contributor

Thanks Reza for implementing this feature so fast! Some background motivation: Since bing bert validation data has 512 sequence length, we couldn't calculate the validation loss during seq128 pretraining if we use deepspeed transformer kernel. This is because deepspeed transformer kernel at initialization remember a fixed sequence length. Reza, Elton and I discussed and thought that supporting dynamic sequence length would be quite useful not only for my experiments, and Reza implemented this feature. Reza included a unit test. I will test Reza's implementation in my pre-training experiments and let you know whether it works.

@RezaYazdaniAminabadi
Copy link
Contributor Author

Thanks Conglong for notifying this nice feature :-) I hope it unblocks your testing and we can add this feature to the kernel.

@conglongli conglongli marked this pull request as draft August 16, 2020 00:57
@conglongli conglongli marked this pull request as ready for review August 17, 2020 18:26
@@ -18,6 +18,7 @@ class Dropout {
}

float RATIO() const { return training ? ratio : 0.0; }
inline void SetDim(uint32_t d) { dim = d; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks "batch" is useless in config, remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I remove that

@@ -121,11 +121,17 @@ class BertTransformerLayer {

void SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
uint8_t* attn_output_dropout_mask_ptr,
uint8_t* layer_output_dropout_mask_ptr);
uint8_t* layer_output_dropout_mask_ptr,
T*,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better give each parameter a meaningful name.

@@ -28,14 +28,12 @@ class Gelu {
T* output,
cudaStream_t stream)
{
launch_bias_gelu<T>(
input_buf, bias, output, _config.intermediate_size, bsz, _config.seq_length, stream);
launch_bias_gelu<T>(input_buf, bias, output, _config.intermediate_size, bsz, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both "batch" and "seq_length" in config can be removed?


inline void SetMean(T* mean)
{
if (!mean) { throw std::runtime_error("Normalize mean is null."); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check config.use_mean here for consistence? or removing "use_mean", just check if mean is nullptr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I cannot do that. These two mean differently, when the use_mean might be true or false based on the layer_norm inversion. However, here this just checks whether the mean is allocated from the outside. So, the SetMean function should not be called when we have normalize_invetible flag to true.

@@ -34,7 +34,12 @@ int cublas_gemm_ex(cublasHandle_t handle,
algo);

if (status != CUBLAS_STATUS_SUCCESS) {
fprintf(stderr, "!!!! kernel execution error.\n");
fprintf(stderr,
"!!!! kernel execution error. (m: %d, n: %d, k: %d, error : %d) \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"error :" -> "error:"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below has several same cases.

layer_output_dropout_mask,
norm2_var,
norm2_mean,
norm3_var,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have 2 norm_layers, suggest renaming all "norm2" or "norm3" to meaningful names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point! :-) I was about to do that a long time ago and I always forgot!


inline int GetBatchSize() const { return _batch_size; }
inline int GetNumHeads() const { return _heads; }
inline int GetSeqLength() const { return _seq_length; }

void SetSeqLength(int seq_len, int bsz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used somewhere?

Copy link
Contributor Author

@RezaYazdaniAminabadi RezaYazdaniAminabadi Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline int GetSeqLength() const { return config_.seq_length; }
inline size_t GetSeqLength() const { return config_.seq_length; }

inline void SetSeqlen(size_t seq_len) { config_.seq_length = seq_len; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest naming it as "SetSeqLength" to be consistent with "GetSeqLength".

@RezaYazdaniAminabadi RezaYazdaniAminabadi force-pushed the reyazda/support_dynamic_seqlength branch from ea5841c to 643c33b Compare September 11, 2020 17:17
@HFadeel
Copy link

HFadeel commented Sep 17, 2020

It will be great if this could be merged. I 'm really excited about it

@RezaYazdaniAminabadi
Copy link
Contributor Author

Hi @HFadeel ,

We are working on merging this soon.
Thanks for the interest :)

Reza

@RezaYazdaniAminabadi
Copy link
Contributor Author

Hi @HFadeel

We have merged these changes into master branch now. Please check in the new changes for DeepSpeed to use the feature.

Thanks you.
Reza

jeffra added a commit that referenced this pull request Jul 18, 2022
* Staging compression library v1 (#314)

* prototype

* add sparse/row/head pruning

* add bert test examples, not testing yet

* rm moq

* add deepspeed based glue example to test compression

* add get/set attr

* tested replacement module

* Custimized Linear Layer Accuracy Checked without any compression technique

* sparse pruning tested

* head pruning tested

* row pruning tested

* enable act dy quantization

* change l1 mask to buffer for better resume training

* add final model saving helper function, only for sparse prunin now

* tested sparse pruning resume training and final model saving

* row pruning resume training and final saving checked

* head pruning resuming training / final model saving

* rm bert from deepspeed

* restruct the code

* add mixed-precision quantization support

* add binary/ternary support

* add weight quantization FP16 assert

* add conv2d

* add compression function

* move config generation to deepspeed side, need elton to take a look

* add activation quantization support

* add sparse pruning support

* add row pruning

* add head pruning

* add channel pruning

* support matching patterns for module names

* update

* fix typo in fix_compression

* add compression scheduler, rm the offset scheduler from MoQ

* fix some errors in head pruning, support redudent clearning (naive version)

* add dim-reduction redudent clearning

* update linear layer

* make cnn example work

* add bn2d

* fix bias issue

* add static act quantization

* support mpu row/colomn parallel linear layer

* add skip_bias_add for mpu linear layers

* make mpu compress work, remove_redundent is not tested yet

* fix several small errors

* add conv1d to linear converter function

* add conv1d to linear converter function

* add conv1d to linear converter function

* make dy-act-quantization per-token or per-image

* cleaning part of the code; more is coming

* enable forward weight quantization which supports both FP32 and some tricky settings

* update readme

* Update README.md

* naming cleaning

* fix static activation loading issue

* update parameter

* Update utils.py

fix a typo

* fix typo

* fix typo

* replace expand_as with view

* Zheweiyao/compression library (#304)

* add forward weight quantization constraint

* add quantize_weight_in_forward warning: a lot of features are not supported

* offset 0 fixing

* add forward weight quantization constraint

* add quantize_weight_in_forward warning: a lot of features are not supported

* offset 0 fixing

* fix a small issue

* omit bias if the model does not have bias

* add contiguous to aviod memory issue

* add scale associated to weight, so people can quantize the weight after training

* add fix weight quantization, change name based on constant.py file

* disable eigen-based MoQ

* When a method is disable (enable: false), we do not need to initialize its related parameters

* weight quantization cleaning

* fix get_quantize_enabled missing problem

* fix redundent cleaning issue, make sure we either get mask from related-module or we enable the method in config

* sort the redundent cleaning step, so we always do quantization, then sparse pruning, then others

* a lot of comment cleaning and args explanation

* add args in config-json.md

* fix format issue

* fix quantization offset step=1 with FP16 optimizer

* Zheweiyao/compression library from s1 (#305)

* add binary/ternary support for FP32 training; this is used to resolve FP16 unstable extreme compression training

* add embedding quantization support

* Xiaoxia/compression library v1 (#307)

* add layer reduction (Xiaoxia/Zhewei)

* fixing bug for sym activation and clean layer reduction (Xiaoxia)

* fixing compression initialization (Xiaoxia/Zhewei)

* fix format issue (#310)

* Xiaoxia/compression library v1 (#311)

* add layer reduction

* fixing bug for sym activation and clean layer reduction

* fixingn compression initialization

* pre-commit...

* Zheweiyao/compression library from s1 (#312)

* fix format issue

* fix the accuracy mismatch after quantization cleaning

* fix clean_model bug and add layer_reduction configuration

Co-authored-by: yaozhewei <[email protected]>
Co-authored-by: Elton Zheng <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* switch to deepspeed comm

* dummy tutorial

* improve config json

* Zheweiyao/compression library based on s2 (#315)

* change the name and merge layer reduction to init_compression

* add conv1d to linear test unit, fix errors introduced by merging studient initialtization to init_compression

* Update config-json.md

* fix for cifar10 channel pruning

* fix the block_eigenvalue is None bug

* fix the block_eigenvalue is None bug

* move compression-related constants and configs to compression

* tutorial and json config

Co-authored-by: Xiaoxia (Shirley) Wu <[email protected]>
Co-authored-by: yaozhewei <[email protected]>
Co-authored-by: Elton Zheng <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: xiaoxiawu <[email protected]>
Co-authored-by: xiaoxiawu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants