Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rust docs #54

Merged
merged 68 commits into from
Feb 16, 2023
Merged

Add rust docs #54

merged 68 commits into from
Feb 16, 2023

Conversation

natke
Copy link
Owner

@natke natke commented Feb 16, 2023

No description provided.

wejoncy and others added 30 commits February 3, 2023 11:01
…ns flags (microsoft#14524)

### Description
If we set flag 'disable_exceptions' to build ORT:


`onnxruntime/contrib_ops/cpu/quantization/qlinear_global_average_pool.cc.o`
woundn't generate such symbols which used by qlinear_pool.c
```
0000000000000000 W _ZN11onnxruntime7contrib27ComputeQLinearGlobalAvgPoolIaEENS_6common6StatusEPKT_fS4_PS4_fS4_lllbPNS_11concurrency10ThreadPoolE
0000000000000000 W _ZN11onnxruntime7contrib27ComputeQLinearGlobalAvgPoolIhEENS_6common6StatusEPKT_fS4_PS4_fS4_lllbPNS_11concurrency10ThreadPoolE
```
so we get a error of undefined symbols of
ComputeQLinearGlobalAvgPool<uin8_t> and
ComputeQLinearGlobalAvgPool<in8_t>......


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…rosoft#14534)

### Description
Fix not working REMOVE_ITEM.

`onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc` is hipyfied from
`onnxruntime/contrib_ops/cuda/aten_ops/aten_op.cc`.
The file correct path is
`${CMAKE_CURRENT_BINARY_DIR}/amdgpu/onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc`
and it exists in hipyfied source files list
`onnxruntime_rocm_generated_contrib_ops_cc_srcs`.

A better way to fix it: If we don't want to build a file. Add it into
hipify excluded files and will not hipify it.
### Description

Add stable diffusion CUDA kernel optimizations.

The following are included:
(1) GroupNorm operator. This kernel is from TensorRT 8.5.
(2) BiasSplitGelu operator. This kernel is modified from SplitGelu of
TensorRT 8.5. We added bias to the SplitGelu.
(3) NhwcConv operator. This adds support of NHWC format (ONNX Conv
operator uses NCHW format).
(3) Update MultiHeadAttention (packed kv and no bias) for cross
attention. This could avoid transpose of kv for TRT fused cross
attention kernel.
(4) Optimization and benchmark script

Not included:
(1) Script to convert Conv to NhwcConv in onnx graph.
(2) Update symbolic shape inference for NhwcConv.
(3) Add SeqLen2Spatial operator
(4) Documents

Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are
implemented based on stable diffusion usage. They might not be
applicable to any input size or dimensions. For example, BiasSplitGelu
requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D
input/weight.

There is minor increasement of binary size. For SM=75 only, python
package wheel size adds (33757K - 33640K) = 117 KB. It is possible to
move NHWC from template parameter to constructor to reduce binary size
(with slight cost of performance).

Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest
cuDNN to get best performance.
If an initializer is used as graph outputs, we should keep its name,
instead of renaming it as constant sharing transformer did currently.

To fix microsoft#14488
### Only link mpi when either use_mpi or use_nccl enabled

To fix the issue microsoft#14278. 

Talked with @askhade, we think if users want to enable NCCL/MPi but MPI
is not found, it should be failure instead of warning.
So this PR made the change. As a result, to make CIs pass, we need
disable NCCL/MPI explicitly in the build command. This PR take an
alternative approach, e.g. since NCCL and MPi are not used for
customers, disable NCCL by default if "--disable_nccl" not specified,
disable MPI by default if "--use_mpi" not specified.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <[email protected]>
Adds back in proper build support for the Navi gen cards (gfx1030) 

Co-authored-by: Ted Themistokleous <[email protected]>
### Description
Change the return type of Softmax
function(`dispatch_warpwise_softmax_forward `and
`dispatch_blockwise_softmax_forward`) from `void ` to `Status`.

### Motivation and Context
Softmax function will call TunableOp which return Status. It's necessary
to pass the `Status` from inner function to outer function.
### Description
Do not use raw pointer for CpuBuffersInfo::buffers object



### Motivation and Context
This PR is to fix the bug 11159:
https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/11159/
…("####") should append for each input_def, not only on the last one
else branch of this if should return ignore_identity

https://github.com/microsoft/onnxruntime/blob/3d7518762ace6929be98e1203174c2dbf1ac094e/onnxruntime/core/optimizer/identical_children_consolidation.cc#L66
identity.append("####") should append for each input_def, not only on
the last one
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Making basic porting effort to run Sampling UT on ROCm ep, based on the
commits:

microsoft#13426
microsoft#14218

1. enabling EmbedLayerNorm op
2. enabling Sampling op
3. enabling helpers to copy data from CPU->GPU for subgraph

This task is the first checkpoint. There could be other missing ops when
testing a real model.
We will migrate more code onto ROCm as needed.

Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
…icrosoft#14606)

### Description
<!-- Describe your changes. -->
To faster unblock pipeline failure globally, disable these real models
tests from onnx repo for now. Meanwhile, we are trying to move these
models to Azure.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
onnx/onnx#4857 these models in onnx repo are
broken. They are setup 4 years ago and the owner of these AWS instances
is unfound.
### Description
update VS2019 to VS 2022 in
onnxruntime-Nuget-WindowsAI-Pipeline-Official


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This is a follow-up of
microsoft#14428 for Stable Diffusion
CUDA optimizations:
(1) use NchwConv to replace Conv in onnx graph and add Tranpose nodes
accordingly
(2) reduce sequential Transpose nodes to at most one.
(3) symbolic shape infer of NchwConv
(4) fix add bias transpose which causes CUDA error (launching more than
1024 threads per block) in inferencing fp32 model.
(5) add models (bert, bart, stable_diffusion subdirectories) to package;
(6) remove option --disable_channels_last

Note that 
(1) We can add a few graph transformations to reduce Transpose nodes
further. It is not done in this PR due to time limit.
(2) Stable diffusion 2.1 model outputs black images. It seems that
forcing Attention to float32 could avoid the issue. However it is much
slow to use float32 Attention.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Reduce the cuda library size by:
1. refactoring beam_search_top_k to reduce template instantiation. It
saves ~56MB
2. opt out TopK for type uint*, int8_t and int16_t. It saves ~50MB.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…l. (microsoft#14602)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Scott McKay <[email protected]>
### Description
<!-- Describe your changes. -->
1. fix a bug in relative position bias kernel where seq_len > 32
2. rename extra_add_qk to relative_position_bias
3. support relative_position_bias in multihead attention (B, N, S, S*)
4. gru_gate support by Lei


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
### Description
Introduce collective ops into onnxruntime inference build, including
1) AllReduce and AllGather schema in contrib op, controlled by USE_MPI
flag
2) AllReduce and AllGather kernel in cuda EP, controlled by ORT_USE_NCCL
flag


### Motivation and Context
Enable the collective ops in onnxruntime inference build so we have the
ability to run distributed inference with multiple GPUs.
The original ncclAllReduce ops in training build require quite complex
configurations, which is not suitable for inference case, and it already
broken. so we introduce a new implementation.

---------

Co-authored-by: Cheng Tang <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
Fix SNPE build issue caused by cmake dependency refactor

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
fix issue: microsoft#14547
…4088)

### Description
This is exposing the already existent interface of asynchronous work of
all CUDA base EP's (CUDA + TensorRT).


### Motivation and Context
This is something requested in microsoft#12216. It will enable users to build an
efficient data pipeline with ONNXRuntime and CUDA pre-/post-processing.
PCI traffic to the CUDA device can be run during inference as soon as
the postprocessing consumed the input buffer and it can be overwritten.
To do this work has to be submitted async to the device. Please see
below screenshots showing the illustration of this using NSight Systems.

Async: 
<img width="1401" alt="image"
src="https://user-images.githubusercontent.com/44298237/209894303-706460ed-cbdb-4be2-a2e4-0c111ec875dd.png">

Synchronous:
<img width="1302" alt="image"
src="https://user-images.githubusercontent.com/44298237/209894630-1ce40925-bbd5-470d-b888-46553ab75fb9.png">

Note the gap in between the 2 inference runs due to issuing PCI traffic
in between and to the CPU overhead the active synchronization has.

---------

Co-authored-by: Chi Lo <[email protected]>
…ft#14619)

This reverts commit f88a464.

### Description
<!-- Describe your changes. -->



### Motivation and Context
For release, winai packaing pipeline's container image is revert to old
image.
So we should revert VS to 2019
### Description
Update list of pipelines to remove obsolete pipelines and reformat
Optional pipelines are not included except for Android and iOS 


![image](https://user-images.githubusercontent.com/20780999/217395702-f08f1252-e1aa-4fec-ac34-1c0b9859ec20.png)
…tput tensor (microsoft#12593)

**Description**:
Support new feature of TVM Virtual Machine (method `set_outputs`) on TVM
Execution Provider side. It allows to avoid excess copying from TVM EP
output tensor to ONNX Runtime one

**Motivation and Context**
Tests with multiple output topologies and big output tensors shows that
there is overheads spent on copying from TVM EP to ONNX Runtime.
Returning output(s) on preallocated memory for VirtualMachine was
implemented on TVM side.

**Details**
`set_output_zero_copy` provider option for TVM EP switches on/off this
feature. It is true by default.
The feature works for both GraphExecutor and VirtualMachine from TVM.

---------

Co-authored-by: Valery Chernov <[email protected]>
jstoecker and others added 29 commits February 9, 2023 10:27
In microsoft#14461 I added a private interface to MLOperatorAuthorPrivate.h to
pipe ORT node names through to the debug name of DML operators/graphs.
The wrong SAL annotation was used on the `Get*Name` methods, which
confused static analysis tools into thinking there is a potential buffer
overrun.
…icrosoft#14638)

### Description
Switch to a static local variable to fix the warning

Comments in the code so it's clear that it's intentional.

### Motivation and Context
Prefast warning: [prefast:Warning]: C26426 (in
'onnxruntime::cuda::`dynamic initializer for 'castOpTypeConstraints''')
Global initializer calls a non-constexpr function
'onnxruntime::DataTypeImpl::GetTensorType<onnxruntime::MLFloat16>'
(i.22).
This makes the the TunableOp tuning results state free and will allow us to
dump and load offline tuning results.
Signed-off-by: Cliff Woolley <[email protected]>

### Description
Including file to fix build w/CUDA 12



### Motivation and Context
It should allow users to compile against CUDA 12

Signed-off-by: Cliff Woolley <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
### Description
Add Softmax Tunable Op, only include blockwise vec implementation and
composable kernel.
Related PR: microsoft#14475,
microsoft#14612

---------

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
…icrosoft#14580)

### Description
Remove the parameter device_id out of ExecutionProvider::GetAllocator()
function



### Motivation and Context
The parameter device_id is not necessary. We can fully rely on the
second parameter OrtMemType mem_type to determine the device_id when
getting allocator from executionProvider.
### Description
Tell more about `OrtEnv` class.

### Motivation and Context
Need to mention the importance of creating `OrtEnv` first.
### Description
Fixes the DML release build for 1.14.1. This was initially fixed by
microsoft#13417 for 1.13.1, but the
changes didn't make their way back to the main branch.
MIOpen created a new API to get the spatial dimensions.
This has been available since July 25th 2022 in MIGraphX. Appared to be
missing from support list of ops

ROCm/AMDMIGraphX#1302

### Description
<!-- Describe your changes. -->

Add in node name for Mod Operator to be supported by MIGraphX


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Expand available functionality to Onnxruntime for the MIGraphX EP

Co-authored-by: Ted Themistokleous <[email protected]>
…ft#14645)

### Description
<!-- Describe your changes. -->

1. fuse rel_pos_bias in T5.
2. remove extended masks in T5 decoder and decoder_init since they
generate all zeros
3. fix a bug in onnx_model.py


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <[email protected]>
### Description
The custom thread entry point was declared `__stdcall` even though the
API dictated a different type. Casting caused improper cleanup of the
stack and crash manifested only in 32-bit Debug builds.

### Motivation and Context
This addresses microsoft#14613
The third part for stable diffusion CUDA optimizations
(1) Add BiasAdd operator to replace two Add (bias and residual); Add
fusion for BiasAdd
(2) Add Attention fusion for VAE decoder.
(3) Update float16 conversion to handle Resize and GroupNorm. This could
reduce two Cast nodes for each Resize op in fp16 model.
(4) Force inputs and outputs to be float16 to avoid data casts in the
pipeline.
(5) Add options --force_fp32_ops, --inspect etc in optimize script so that
user could force some operator to run in float32 to potentially get
better image quality (with cost of performance).

Performance tests show slight improvement in T4. Average latency reduced
0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.
Add the ability to get and set tuning results of an inference session.
Also add tool to manipulate onnx file to embed the results into the
model file and automatically load it on session initialization.
When inferencing real gpt2-based model, found some gaps between CUDA and
ROCm codebase.

The fixes include:

1. minimum code change to fix tensor shape on Attention Op
2. Support optional output tensor with SkipLayerNorm
3. fix a build error found on MI200

---------

Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
skip col2im_pads test in model test.

### Motivation and Context
The failed test blocks updating the new image.
### Description
FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel,
and ARM64 NEON kernel.


### Motivation and Context
First step in creating native support of fp16 model inferencing on ARM64
and AMD64 platforms.

---------

Co-authored-by: Chen Fu <[email protected]>
(1) Support packed QKV format in MultiHeadAttention. This format could
avoid add bias transpose when TRT fused kernel is used.
(2) Add cache for cumulated sequence length computation. For SD, it only
need computed once since sequence length is fixed.
(3) Do not allocate qkv workspace to save memory for packed KV or QKV.
(4) Add unit tests for packed kv and packed qkv format in
MultiHeadAttention
(5) Mark some fusion options for SD only

Performance tests show slight improvement in T4. Average latency reduced
0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5
models. Memory usage drops from 5.1GB to 4.8GB.
### Description

* add noexcept to `InitApi()` and `GetApi()`

### Motivation and Context

* fixes microsoft#12581
…t order pr… (microsoft#14700)

Current configuration has CPU as the highest priority as per the specification found at :
https://onnxruntime.ai/docs/api/python/api_summary.html#inferencesession

providers – Optional sequence of providers in order of decreasing precedence.
Values can either be provider names or tuples of (provider name, options dict). If not provided,
then all available providers are used with the default precedence.

Sets correct operator precedence for the EPs in parity utilities for test runs

Ruling out any odd out of order issues when setting up tests for multiple EPs

Co-authored-by: Ted Themistokleous <[email protected]>
1. Add Softmax warpwise_forward into SoftmaxTunableOp.
2. Set Softmax op use tunableOp as optional and use original
implementation by default.
3. There are some other operators use `dispatch_warpwise_softmax_forward
/dispatch_warpwise_softmax_forward/ SoftMaxComputeHelper ` directly. But
they only have files under cuda directory, adding `RocmTuningContext `
for these files requires copying and modifying hipified files. Now only
set RocmTuningContext as nullptr by default and not hipified other
operators.
Related PR: microsoft#14541

---------

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
Add a fusion to remove transpose in subgraph like  
```
--> Gemm --> Unsqueeze(axes=[2]) --> Unsqueeze(axes=[3]) --> Add --> Transpose([0,2,3,1]) --> GroupNorm
```
With this fusion, we can remove 22 Transpose nodes in UNet, and reduce
latency by 0.1 second per image in T4.
@natke natke merged commit 0a27a4b into main Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.