ORT 1.14.0 release -- cherry pick round2 #14573

rui-ren · 2023-02-03T22:07:35Z

Description

Second round cherry pick, total 13 PRs, as below. Please check here for Here for the total list.

Date	PR	#	Commit #	Short #
1	Fix unused variable for CUDA EP builds with USE_FLASH_ATTENTION off	14404	`85d7e9c`	`85d7e9c`
2	UNet fusion and fp16 conversion for stable diffusion	14248	`a95fcb4`	`a95fcb4`
3	upgrade protobuf to 3.20.2 and onnx to 1.13	14279	`80f807c`	`80f807c`
4	Include python training apis when enable_training is enabled	14485	`d06ad94`	d06ad94
5	Including support for Deepspeed 0.8.0	14506	`6fa4555`	`6fa4555`
6	change deepspeed version in warning from 0.7.3 to 0.8.0	14527	`3d388a1`	`3d388a1`
7	Do not fuse DQ+Node+Q if DQ produces graph output	14509	`d9e675a`	`d9e675a`
8	upgrade EsrpCodeSigning from v1 to v2	14531	`0578eef`	`0578eef`
9	Fix python packaging pipeline	14533	`7954976`	`7954976`
10	Specify deps in deps.txt and manifest	14530	`01cafe8`	`01cafe8`
11	Fix Gather to Split optimizer	14478	`0bcca7a`	`0bcca7a`
12	Stable Diffusion CUDA Optimizations	14428	`a6c5ba0`	`a6c5ba0`
13	Fix sharing scalar bug	14544	`c6c1103`	`c6c1103`

Motivation and Context

Second round cherry-pick for ORT 1.14.0 release.

…14404) ### Description Fixes unused `use_memory_efficient_attention` variable in contrib_ops/cuda/bert/attention_impl.cu. ### Motivation and Context ORT with CUDA version < 11.6 fails to build for release configurations due to an unused variable. ```shell c:\...\onnxruntime\onnxruntime\contrib_ops\cuda\bert\attention_impl.cu(420): error : variable "use_memory_efficient_attention" was declared but never referenced [C:\...\onnxruntime\build\Windows\RelWithDebInfo\onnx runtime_providers_cuda.vcxproj] detected during instantiation of "onnxruntime::common::Status onnxruntime::contrib::cuda::QkvToContext(const cudaDeviceProp &, cublasHandle_t &, cudaStream_t, onnxruntime::contrib::AttentionParameters &, onnxruntime::contrib::cuda::AttentionData<T> &) [wit h T=float]" (923): here ``` This happens for CUDA < 11.6. Our cmake script turns off onnxruntime_USE_FLASH_ATTENTION for CUDA < 11.6, which leaves the aforementioned variable unused outside of asserts (which are removed in release builds). The USE_FLASH_ATTENTION option was added by #14343

Add script to fuse nodes to optimized operators in stable diffusion 1.5 models, and a script to convert fp32 models to fp16 models. Tested with stable diffusion 1.5. Note that the optimized model needs onnxruntime-gpu v1.14 (release candidate will be available soon). Note: We will update the script to work with latest diffusers and stable diffusion v2 and v2.1 models.

### Description upgrade protobuf to 3.20.2, same as onnx 1.13.0 ### Motivation and Context Per component governance requirement and Fixes #14060 unused-parameter error occurs in 2 conditions. 1. compile protolbuf `onnxruntime_src/cmake/external/protobuf/src/google/protobuf/repeated_ptr_field.h:752:66: error: unused parameter ‘prototype’ [-Werror=unused-parameter]` 2. include onnx_pb.h ``` 2023-01-28T10:20:15.0410853Z FAILED: CMakeFiles/onnxruntime_pybind11_state.dir/onnxruntime_src/onnxruntime/python/onnxruntime_pybind_iobinding.cc.o ...... 2023-01-28T10:20:15.0466024Z from /build/Debug/_deps/onnx-src/onnx/onnx_pb.h:51, 2023-01-28T10:20:15.0466958Z from /onnxruntime_src/include/onnxruntime/core/framework/to_tensor_proto_element_type.h:10, .... 2023-01-28T10:20:15.0609678Z /build/Debug/_deps/onnx-build/onnx/onnx-operators-ml.pb.h:1178:25: required from here 2023-01-28T10:20:15.0610895Z /onnxruntime_src/cmake/external/protobuf/src/google/protobuf/repeated_ptr_field.h:752:66: error: unused parameter ‘prototype’ [-Werror=unused-parameter] 2023-01-28T10:20:15.0611707Z cc1plus: all warnings being treated as errors ``` https://dev.azure.com/onnxruntime/2a773b67-e88b-4c7f-9fc0-87d31fea8ef2/_apis/build/builds/874605/logs/22

#14485)

### Description Including Support for Deepspeed 0.8.0. ### Motivation and Context Deepspeed 0.8.0 has a bug fix and mlfow integration.

### Description change deepspeed version in warning from 0.7.3 to 0.8.0 ### Motivation and Context The version was updated for Deepspeed support in ORT from 0.7.3 to 0.8.0 but wasn't updated in the warnings message and this PR is to fix that.

Fix issue #14501

### Description This change upgrade EsrpCodeSigning from v1 to v2 in our build pipeline.

fix onnx and protobuf inconsistencies in python packaging pipeline.

Specify new deps and update cgmanifest.json. --------- Co-authored-by: Randy Shuai <[email protected]>

### Description Gather to Split optimizer fails if opset == 18. This PR fixes one bug and extend unit tests. ### Motivation and Context The model produced by the optimizer does not follow onnx specifications with opset 18.

### Description Add stable diffusion CUDA kernel optimizations. The following are included: (1) GroupNorm operator. This kernel is from TensorRT 8.5. (2) BiasSplitGelu operator. This kernel is modified from SplitGelu of TensorRT 8.5. We added bias to the SplitGelu. (3) NhwcConv operator. This adds support of NHWC format (ONNX Conv operator uses NCHW format). (3) Update MultiHeadAttention (packed kv and no bias) for cross attention. This could avoid transpose of kv for TRT fused cross attention kernel. (4) Optimization and benchmark script Not included: (1) Script to convert Conv to NhwcConv in onnx graph. (2) Update symbolic shape inference for NhwcConv. (3) Add SeqLen2Spatial operator (4) Documents Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are implemented based on stable diffusion usage. They might not be applicable to any input size or dimensions. For example, BiasSplitGelu requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D input/weight. There is minor increasement of binary size. For SM=75 only, python package wheel size adds (33757K - 33640K) = 117 KB. It is possible to move NHWC from template parameter to constructor to reduce binary size (with slight cost of performance). Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest cuDNN to get best performance.

If an initializer is used as graph outputs, we should keep its name, instead of renaming it as constant sharing transformer did currently. To fix #14488

adrianlizarraga and others added 13 commits February 3, 2023 13:50

[Bug Fix] Include python training apis when enable_training is enabled (

40bbdf7

#14485)

Including support for Deepspeed 0.8.0 (#14506)

80a5c64

### Description Including Support for Deepspeed 0.8.0. ### Motivation and Context Deepspeed 0.8.0 has a bug fix and mlfow integration.

Do not fuse DQ+Node+Q if DQ produces graph output (#14509)

3b4d921

Fix issue #14501

upgrade EsrpCodeSigning from v1 to v2 (#14531)

0a75311

### Description This change upgrade EsrpCodeSigning from v1 to v2 in our build pipeline.

Fix python packaging pipeline (#14533)

a4d605a

fix onnx and protobuf inconsistencies in python packaging pipeline.

Specify deps in deps.txt and manifest (#14530)

73d6c81

Specify new deps and update cgmanifest.json. --------- Co-authored-by: Randy Shuai <[email protected]>

Fix Gather to Split optimizer (#14478)

168ac39

### Description Gather to Split optimizer fails if opset == 18. This PR fixes one bug and extend unit tests. ### Motivation and Context The model produced by the optimizer does not follow onnx specifications with opset 18.

Fix sharing scalar bug (#14544)

0e63a8d

If an initializer is used as graph outputs, we should keep its name, instead of renaming it as constant sharing transformer did currently. To fix #14488

rui-ren requested review from mszhanyi, faxu and tianleiwu February 3, 2023 22:07

rui-ren requested review from a team as code owners February 3, 2023 22:07

rui-ren requested review from hanbitmyths and fdwr February 3, 2023 22:08

tianleiwu approved these changes Feb 3, 2023

View reviewed changes

faxu approved these changes Feb 7, 2023

View reviewed changes

rui-ren merged commit 5ae597d into rel-1.14.0 Feb 7, 2023

rui-ren deleted the ruiren/cherry-pick-round2 branch February 7, 2023 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORT 1.14.0 release -- cherry pick round2 #14573

ORT 1.14.0 release -- cherry pick round2 #14573

rui-ren commented Feb 3, 2023

ORT 1.14.0 release -- cherry pick round2 #14573

ORT 1.14.0 release -- cherry pick round2 #14573

Conversation

rui-ren commented Feb 3, 2023

Description

Motivation and Context