Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick 7 changes into 1.14.1 #14762

Merged
merged 8 commits into from
Feb 23, 2023

Conversation

PatriceVignola
Copy link
Contributor

@PatriceVignola PatriceVignola commented Feb 22, 2023

This cherry picks the following 7 changes into 1.14.1:

1b7f654
b539c36
12d9117
ff3aed8
3d79b1f
c0d2472
e9ec4c0

smk2007 and others added 7 commits February 21, 2023 20:48
…ions agnostic to backend EP (#14442)

Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP

Opset 11 introduced the following sequence related operators:
    - SequenceAt
    - SequenceConstruct
    - SequenceEmpty
    - SequenceLength
    - SequenceErase
    - SequenceInsert
    - ConcatFromSequence

With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.

Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.

In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.

---------

Co-authored-by: Patrice Vignola <[email protected]>
### Description
<!-- Describe your changes. -->
1. fix a bug in relative position bias kernel where seq_len > 32
2. rename extra_add_qk to relative_position_bias
3. support relative_position_bias in multihead attention (B, N, S, S*)
4. gru_gate support by Lei


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Signed-off-by: Cliff Woolley <[email protected]>

### Description
Including file to fix build w/CUDA 12



### Motivation and Context
It should allow users to compile against CUDA 12

Signed-off-by: Cliff Woolley <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
### Description
Create new stream for data copy for IOBidning input scenario



### Motivation and Context
Previously in bindInput(), a nullptr Stream is passed to copy data cross
device. This caused the default stream is used thus hurt the
performance.
This PR is to fix #14484

---------

Co-authored-by: Lei Cao <[email protected]>
There is accuracy regression in GPT-2 model. Top1 match rate (vs PyTorch
model) drops about 1%. The cause is the fused causal attention uses fp16
accumulation. Disable it by default and add an environment variable 
ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1 to turn on it manually.

It also updated the GPT-2 parity test script to generate left side
padding to reflect the actual usage.

To test:
```
python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m gpt2 --output gpt2.onnx -o -p fp16 --use_gpu
```
The top1-match-rate in the output is on-par with ORT 1.13.1.
Current Sigmoid's CUDA kernel uses target data type for all computation.
For some small negative numbers, if using FP16, it will loss precision.
For example, for input [-7.8477, 7.3320, -7.8008, 6.6016], the expected
output is [3.9047e-04, 9.9935e-01, 4.0919e-04, 9.9864e-01], but current
kernel will generate result [0.0000, 0.9990, 0.0000, 0.9990]. If some
sub-graph contains Sigmoid, such as BinaryCrossEntropyWithLogits, it's
likely to produce NaN as compute result.

The PR fixes this by using FP32 for kernel internal computation. Note
that the fix will not have perf regression, as CUDA's _Exp will also do
float to half casting, so the fix doesn't introduce extra cast. We move
the cast to right begin and end of the whole kernel so that other parts
of computation are also in FP32 (instead of only Exp).
@PatriceVignola PatriceVignola requested a review from a team as a code owner February 22, 2023 19:48
@PatriceVignola PatriceVignola changed the title Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP Cherry-pick 7 changes into 1.14.1 Feb 22, 2023
wangyems
wangyems previously approved these changes Feb 22, 2023
tianleiwu
tianleiwu previously approved these changes Feb 22, 2023
centwang
centwang previously approved these changes Feb 23, 2023
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@PatriceVignola PatriceVignola dismissed stale reviews from centwang, tianleiwu, and wangyems via 419e514 February 23, 2023 07:30
@PatriceVignola PatriceVignola requested a review from a team as a code owner February 23, 2023 07:30
@mszhanyi
Copy link
Contributor

Could we cherry-pick #b3b9be1 directly? @jchen351

@PatriceVignola PatriceVignola merged commit 3e4a471 into rel-1.14.1 Feb 23, 2023
@PatriceVignola PatriceVignola deleted the user/pavignol/enable-sequence-ops branch February 23, 2023 19:00
preetha-intel pushed a commit to intel/onnxruntime that referenced this pull request Jun 6, 2023
This cherry picks the following 7 changes into 1.14.1:


microsoft@1b7f654

microsoft@b539c36

microsoft@12d9117

microsoft@ff3aed8

microsoft@3d79b1f

microsoft@c0d2472

microsoft@e9ec4c0

---------

Signed-off-by: Cliff Woolley <[email protected]>
Co-authored-by: Sheil Kumar <[email protected]>
Co-authored-by: Ye Wang <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Misha Chornyi <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: cao lei <[email protected]>
Co-authored-by: Lei Cao <[email protected]>
Co-authored-by: Tianlei Wu <[email protected]>
Co-authored-by: Vincent Wang <[email protected]>
Co-authored-by: Jian Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants