Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORT 1.14.0 release -- cherry pick round2 #14573

Merged
merged 13 commits into from
Feb 7, 2023
Merged

Conversation

rui-ren
Copy link
Contributor

@rui-ren rui-ren commented Feb 3, 2023

Description

Second round cherry pick, total 13 PRs, as below. Please check here for Here for the total list.

<style> </style>
Date PR # Commit # Short #
1 Fix unused variable for CUDA EP builds with USE_FLASH_ATTENTION off 14404 85d7e9c 85d7e9c
2 UNet fusion and fp16 conversion for stable diffusion 14248 a95fcb4 a95fcb4
3 upgrade protobuf to 3.20.2 and onnx to 1.13 14279 80f807c 80f807c
4 Include python training apis when enable_training is enabled 14485 d06ad94 d06ad94
5 Including support for Deepspeed 0.8.0 14506 6fa4555 6fa4555
6 change deepspeed version in warning from 0.7.3 to 0.8.0 14527 3d388a1 3d388a1
7 Do not fuse DQ+Node+Q if DQ produces graph output 14509 d9e675a d9e675a
8 upgrade EsrpCodeSigning from v1 to v2 14531 0578eef 0578eef
9 Fix python packaging pipeline 14533 7954976 7954976
10 Specify deps in deps.txt and manifest 14530 01cafe8 01cafe8
11 Fix Gather to Split optimizer 14478 0bcca7a 0bcca7a
12 Stable Diffusion CUDA Optimizations 14428 a6c5ba0 a6c5ba0
13 Fix sharing scalar bug 14544 c6c1103 c6c1103

Motivation and Context

Second round cherry-pick for ORT 1.14.0 release.

adrianlizarraga and others added 13 commits February 3, 2023 13:50
…14404)

### Description
Fixes unused `use_memory_efficient_attention` variable in
contrib_ops/cuda/bert/attention_impl.cu.



### Motivation and Context
ORT with CUDA version < 11.6 fails to build for release configurations
due to an unused variable.

```shell
c:\...\onnxruntime\onnxruntime\contrib_ops\cuda\bert\attention_impl.cu(420): error : variable "use_memory_efficient_attention" was declared but never referenced [C:\...\onnxruntime\build\Windows\RelWithDebInfo\onnx
runtime_providers_cuda.vcxproj]
            detected during instantiation of "onnxruntime::common::Status onnxruntime::contrib::cuda::QkvToContext(const cudaDeviceProp &, cublasHandle_t &, cudaStream_t, onnxruntime::contrib::AttentionParameters &, onnxruntime::contrib::cuda::AttentionData<T> &) [wit
  h T=float]"
  (923): here
```

This happens for CUDA < 11.6. Our cmake script turns off
onnxruntime_USE_FLASH_ATTENTION for CUDA < 11.6, which leaves the
aforementioned variable unused outside of asserts (which are removed in
release builds).

The USE_FLASH_ATTENTION option was added by
#14343
Add script to fuse nodes to optimized operators in stable diffusion 1.5
models, and a script to convert fp32 models to fp16 models. Tested with
stable diffusion 1.5.

Note that the optimized model needs onnxruntime-gpu v1.14 (release candidate
will be available soon).

Note: We will update the script to work with latest diffusers and stable
diffusion v2 and v2.1 models.
### Description
upgrade protobuf to 3.20.2, same as onnx 1.13.0

### Motivation and Context
Per component governance requirement and Fixes #14060

unused-parameter error occurs in 2 conditions.
1. compile protolbuf

`onnxruntime_src/cmake/external/protobuf/src/google/protobuf/repeated_ptr_field.h:752:66:
error: unused parameter ‘prototype’ [-Werror=unused-parameter]`
2. include onnx_pb.h
```
2023-01-28T10:20:15.0410853Z FAILED: CMakeFiles/onnxruntime_pybind11_state.dir/onnxruntime_src/onnxruntime/python/onnxruntime_pybind_iobinding.cc.o 
......
2023-01-28T10:20:15.0466024Z                  from /build/Debug/_deps/onnx-src/onnx/onnx_pb.h:51,
2023-01-28T10:20:15.0466958Z                  from /onnxruntime_src/include/onnxruntime/core/framework/to_tensor_proto_element_type.h:10,
....
2023-01-28T10:20:15.0609678Z /build/Debug/_deps/onnx-build/onnx/onnx-operators-ml.pb.h:1178:25:   required from here
2023-01-28T10:20:15.0610895Z /onnxruntime_src/cmake/external/protobuf/src/google/protobuf/repeated_ptr_field.h:752:66: error: unused parameter ‘prototype’ [-Werror=unused-parameter]
2023-01-28T10:20:15.0611707Z cc1plus: all warnings being treated as errors

```

https://dev.azure.com/onnxruntime/2a773b67-e88b-4c7f-9fc0-87d31fea8ef2/_apis/build/builds/874605/logs/22
### Description
Including Support for Deepspeed 0.8.0.



### Motivation and Context
Deepspeed 0.8.0 has a bug fix and mlfow integration.
### Description
change deepspeed version in warning from 0.7.3 to 0.8.0



### Motivation and Context
The version was updated for Deepspeed support in ORT from 0.7.3 to 0.8.0
but wasn't updated in the warnings message and this PR is to fix that.
### Description
This change upgrade EsrpCodeSigning from v1 to v2 in our build pipeline.
fix onnx and protobuf inconsistencies in python packaging pipeline.
Specify new deps and update cgmanifest.json.

---------

Co-authored-by: Randy Shuai <[email protected]>
### Description
Gather to Split optimizer fails if opset == 18. This PR fixes one bug
and extend unit tests.



### Motivation and Context
The model produced by the optimizer does not follow onnx specifications
with opset 18.
### Description

Add stable diffusion CUDA kernel optimizations.

The following are included:
(1) GroupNorm operator. This kernel is from TensorRT 8.5.
(2) BiasSplitGelu operator. This kernel is modified from SplitGelu of
TensorRT 8.5. We added bias to the SplitGelu.
(3) NhwcConv operator. This adds support of NHWC format (ONNX Conv
operator uses NCHW format).
(3) Update MultiHeadAttention (packed kv and no bias) for cross
attention. This could avoid transpose of kv for TRT fused cross
attention kernel.
(4) Optimization and benchmark script

Not included:
(1) Script to convert Conv to NhwcConv in onnx graph.
(2) Update symbolic shape inference for NhwcConv.
(3) Add SeqLen2Spatial operator
(4) Documents

Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are
implemented based on stable diffusion usage. They might not be
applicable to any input size or dimensions. For example, BiasSplitGelu
requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D
input/weight.

There is minor increasement of binary size. For SM=75 only, python
package wheel size adds (33757K - 33640K) = 117 KB. It is possible to
move NHWC from template parameter to constructor to reduce binary size
(with slight cost of performance).

Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest
cuDNN to get best performance.
If an initializer is used as graph outputs, we should keep its name,
instead of renaming it as constant sharing transformer did currently.

To fix #14488
@rui-ren rui-ren requested review from a team as code owners February 3, 2023 22:07
@rui-ren rui-ren requested review from hanbitmyths and fdwr February 3, 2023 22:08
@rui-ren rui-ren merged commit 5ae597d into rel-1.14.0 Feb 7, 2023
@rui-ren rui-ren deleted the ruiren/cherry-pick-round2 branch February 7, 2023 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.