-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support dynamic sequence length #320
Conversation
Thanks Reza for implementing this feature so fast! Some background motivation: Since bing bert validation data has 512 sequence length, we couldn't calculate the validation loss during seq128 pretraining if we use deepspeed transformer kernel. This is because deepspeed transformer kernel at initialization remember a fixed sequence length. Reza, Elton and I discussed and thought that supporting dynamic sequence length would be quite useful not only for my experiments, and Reza implemented this feature. Reza included a unit test. I will test Reza's implementation in my pre-training experiments and let you know whether it works. |
Thanks Conglong for notifying this nice feature :-) I hope it unblocks your testing and we can add this feature to the kernel. |
@@ -18,6 +18,7 @@ class Dropout { | |||
} | |||
|
|||
float RATIO() const { return training ? ratio : 0.0; } | |||
inline void SetDim(uint32_t d) { dim = d; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks "batch" is useless in config, remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I remove that
csrc/includes/ds_transformer_cuda.h
Outdated
@@ -121,11 +121,17 @@ class BertTransformerLayer { | |||
|
|||
void SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr, | |||
uint8_t* attn_output_dropout_mask_ptr, | |||
uint8_t* layer_output_dropout_mask_ptr); | |||
uint8_t* layer_output_dropout_mask_ptr, | |||
T*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better give each parameter a meaningful name.
@@ -28,14 +28,12 @@ class Gelu { | |||
T* output, | |||
cudaStream_t stream) | |||
{ | |||
launch_bias_gelu<T>( | |||
input_buf, bias, output, _config.intermediate_size, bsz, _config.seq_length, stream); | |||
launch_bias_gelu<T>(input_buf, bias, output, _config.intermediate_size, bsz, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both "batch" and "seq_length" in config can be removed?
|
||
inline void SetMean(T* mean) | ||
{ | ||
if (!mean) { throw std::runtime_error("Normalize mean is null."); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check config.use_mean here for consistence? or removing "use_mean", just check if mean is nullptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I cannot do that. These two mean differently, when the use_mean might be true or false based on the layer_norm inversion. However, here this just checks whether the mean is allocated from the outside. So, the SetMean function should not be called when we have normalize_invetible flag to true.
csrc/transformer/cublas_wrappers.cu
Outdated
@@ -34,7 +34,12 @@ int cublas_gemm_ex(cublasHandle_t handle, | |||
algo); | |||
|
|||
if (status != CUBLAS_STATUS_SUCCESS) { | |||
fprintf(stderr, "!!!! kernel execution error.\n"); | |||
fprintf(stderr, | |||
"!!!! kernel execution error. (m: %d, n: %d, k: %d, error : %d) \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"error :" -> "error:"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
below has several same cases.
deepspeed/pt/deepspeed_cuda.py
Outdated
layer_output_dropout_mask, | ||
norm2_var, | ||
norm2_mean, | ||
norm3_var, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have 2 norm_layers, suggest renaming all "norm2" or "norm3" to meaningful names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point! :-) I was about to do that a long time ago and I always forgot!
|
||
inline int GetBatchSize() const { return _batch_size; } | ||
inline int GetNumHeads() const { return _heads; } | ||
inline int GetSeqLength() const { return _seq_length; } | ||
|
||
void SetSeqLength(int seq_len, int bsz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this used somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is used in ds_transformer_cuda.cpp: https://github.com/microsoft/DeepSpeed/blob/reyazda/support_dynamic_seqlength/csrc/transformer/ds_transformer_cuda.cpp#L708
csrc/includes/softmax.h
Outdated
inline int GetSeqLength() const { return config_.seq_length; } | ||
inline size_t GetSeqLength() const { return config_.seq_length; } | ||
|
||
inline void SetSeqlen(size_t seq_len) { config_.seq_length = seq_len; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest naming it as "SetSeqLength" to be consistent with "GetSeqLength".
ea5841c
to
643c33b
Compare
It will be great if this could be merged. I 'm really excited about it |
Hi @HFadeel , We are working on merging this soon. Reza |
Hi @HFadeel We have merged these changes into master branch now. Please check in the new changes for DeepSpeed to use the feature. Thanks you. |
* Staging compression library v1 (#314) * prototype * add sparse/row/head pruning * add bert test examples, not testing yet * rm moq * add deepspeed based glue example to test compression * add get/set attr * tested replacement module * Custimized Linear Layer Accuracy Checked without any compression technique * sparse pruning tested * head pruning tested * row pruning tested * enable act dy quantization * change l1 mask to buffer for better resume training * add final model saving helper function, only for sparse prunin now * tested sparse pruning resume training and final model saving * row pruning resume training and final saving checked * head pruning resuming training / final model saving * rm bert from deepspeed * restruct the code * add mixed-precision quantization support * add binary/ternary support * add weight quantization FP16 assert * add conv2d * add compression function * move config generation to deepspeed side, need elton to take a look * add activation quantization support * add sparse pruning support * add row pruning * add head pruning * add channel pruning * support matching patterns for module names * update * fix typo in fix_compression * add compression scheduler, rm the offset scheduler from MoQ * fix some errors in head pruning, support redudent clearning (naive version) * add dim-reduction redudent clearning * update linear layer * make cnn example work * add bn2d * fix bias issue * add static act quantization * support mpu row/colomn parallel linear layer * add skip_bias_add for mpu linear layers * make mpu compress work, remove_redundent is not tested yet * fix several small errors * add conv1d to linear converter function * add conv1d to linear converter function * add conv1d to linear converter function * make dy-act-quantization per-token or per-image * cleaning part of the code; more is coming * enable forward weight quantization which supports both FP32 and some tricky settings * update readme * Update README.md * naming cleaning * fix static activation loading issue * update parameter * Update utils.py fix a typo * fix typo * fix typo * replace expand_as with view * Zheweiyao/compression library (#304) * add forward weight quantization constraint * add quantize_weight_in_forward warning: a lot of features are not supported * offset 0 fixing * add forward weight quantization constraint * add quantize_weight_in_forward warning: a lot of features are not supported * offset 0 fixing * fix a small issue * omit bias if the model does not have bias * add contiguous to aviod memory issue * add scale associated to weight, so people can quantize the weight after training * add fix weight quantization, change name based on constant.py file * disable eigen-based MoQ * When a method is disable (enable: false), we do not need to initialize its related parameters * weight quantization cleaning * fix get_quantize_enabled missing problem * fix redundent cleaning issue, make sure we either get mask from related-module or we enable the method in config * sort the redundent cleaning step, so we always do quantization, then sparse pruning, then others * a lot of comment cleaning and args explanation * add args in config-json.md * fix format issue * fix quantization offset step=1 with FP16 optimizer * Zheweiyao/compression library from s1 (#305) * add binary/ternary support for FP32 training; this is used to resolve FP16 unstable extreme compression training * add embedding quantization support * Xiaoxia/compression library v1 (#307) * add layer reduction (Xiaoxia/Zhewei) * fixing bug for sym activation and clean layer reduction (Xiaoxia) * fixing compression initialization (Xiaoxia/Zhewei) * fix format issue (#310) * Xiaoxia/compression library v1 (#311) * add layer reduction * fixing bug for sym activation and clean layer reduction * fixingn compression initialization * pre-commit... * Zheweiyao/compression library from s1 (#312) * fix format issue * fix the accuracy mismatch after quantization cleaning * fix clean_model bug and add layer_reduction configuration Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Elton Zheng <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * switch to deepspeed comm * dummy tutorial * improve config json * Zheweiyao/compression library based on s2 (#315) * change the name and merge layer reduction to init_compression * add conv1d to linear test unit, fix errors introduced by merging studient initialtization to init_compression * Update config-json.md * fix for cifar10 channel pruning * fix the block_eigenvalue is None bug * fix the block_eigenvalue is None bug * move compression-related constants and configs to compression * tutorial and json config Co-authored-by: Xiaoxia (Shirley) Wu <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Elton Zheng <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: xiaoxiawu <[email protected]> Co-authored-by: xiaoxiawu <[email protected]>
I did some changes to the transformer kernel code to support sequence-lengths dynamically.