Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support (Bias)SkipLayerNormalization fusion in GPT2 #13988

Merged
merged 23 commits into from
Dec 22, 2022

Conversation

hariharans29
Copy link
Member

@hariharans29 hariharans29 commented Dec 15, 2022

Description

The GPT2 model has a slightly different SkipLayerNormalization pattern from BERT in that the residual to the next layer is not the output of LayerNorm like in BERT but the added residual and the input fed to the LayerNorm node thus making the Add and LayerNorm to not be fused as there is a consumer of the intermediate output of Add. This also means that any Add after MatMuls feeding into SkipLayerNortmalization won't be fused as well (as the pre-requisite to that fusion is that SkipLayerNormalization be fused first). This change adds support for this variant of SkipLayerNormalization (SLN) by adjusting the schema of SLN to have an optional output which will be the output of the addition of the residual and the input.

image

TODO: Add kernel test

Motivation and Context

Improve fusion coverage for GPT2 and help improve its perf and any language model using it

@hariharans29 hariharans29 changed the title WIP: Support (Bias)SkipLayerNormalization fusion in GPT2 Support (Bias)SkipLayerNormalization fusion in GPT2 Dec 20, 2022
@@ -537,7 +537,14 @@ def fuse_gpt2(self, layernorm, add_before_layernorm, input_name_to_nodes, output
if two_gather is None:
return False

# If the add_before_layernorm node is an Add node, then the add_output output is the first index
Copy link
Contributor

@tianleiwu tianleiwu Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It is better to also update the comment at the beginning of this function. That comment only contains first case but not the second.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I donm't see any other comment at the beginnign of this function. Are you referring to the args description of is_embedding_sum_needed() before this by any chance ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streamlined some logic around optional_embedding_sum_output and add_output which both need the comment. It should be easier to understand the code now.

onnxruntime/test/python/transformers/test_optimizer.py Outdated Show resolved Hide resolved
@hariharans29 hariharans29 merged commit 7ed8bd4 into main Dec 22, 2022
@hariharans29 hariharans29 deleted the hari/skip_layer_normalization branch December 22, 2022 07:04
tianleiwu added a commit that referenced this pull request Dec 22, 2022
### Description

Add support of ONNX conversion of GPT-2 for two stages:
* Stage 1 is the initial stage that has empty past state. 
* Stage 2 has non-empty past state and sequence_length is 1.

Add a parameter --stage to specify such stage. For stage 1, we will
enable mask_index for Attention so that we can use fused attention in
CUDA.

Other changes:
(1) use int32 inputs as default (otherwise, there is error in inference)
(2) update gpt2_parity to include SkipLayerNormalization (see
#13988) and
EmbedLayerNormalization
(3) get all environment variables that might impact GPT-2 latency in
benchmark_gpt2

### Motivation and Context

To test fused attention for GPT-2 model for
#13953.
henrywu2019 pushed a commit to henrywu2019/onnxruntime that referenced this pull request Dec 26, 2022
henrywu2019 pushed a commit to henrywu2019/onnxruntime that referenced this pull request Dec 26, 2022
### Description

Add support of ONNX conversion of GPT-2 for two stages:
* Stage 1 is the initial stage that has empty past state. 
* Stage 2 has non-empty past state and sequence_length is 1.

Add a parameter --stage to specify such stage. For stage 1, we will
enable mask_index for Attention so that we can use fused attention in
CUDA.

Other changes:
(1) use int32 inputs as default (otherwise, there is error in inference)
(2) update gpt2_parity to include SkipLayerNormalization (see
microsoft#13988) and
EmbedLayerNormalization
(3) get all environment variables that might impact GPT-2 latency in
benchmark_gpt2

### Motivation and Context

To test fused attention for GPT-2 model for
microsoft#13953.
wangyems pushed a commit that referenced this pull request Jan 3, 2023
### Description
1. SkipLayerNormalization has a new output
(#13988) and the symbolic
shape inference script needs corresponding updates

2. The greedy sampling op
(#13426) shouldn't re-use
the logits buffer as its corresponding kernel doesn't seem to support it
yet.

### Motivation and Context
Fix some transformer issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants