Sync update-labeling-workflow with main #21658

sophies927 · 2024-08-07T20:55:18Z

Description

Motivation and Context

### Description - Extends the QDQPropagationTransformer to propagate DQs (forward) across operators with multiple consumers (previously only supported 1 consumer). - Adds Slice to the list of operators that the QDQPropagationTransformer can propagate DQ/Q ops across. - Supports QDQ propagation for opset 21. - Correctly copies Q or DQ attributes when creating new nodes. ### Motivation and Context The QDQPropagationTransformer fixes up QDQ node units for certain "data movement" ops (e.g., Transpose) by inserting Q -> DQ sequences where necessary. For example, the sequence `DQ -> Transpose -> Sigmoid` is transformed to `DQ -> Transpose -> Q -> DQ -> Sigmoid`. However, this fix-up does not currently support data movement ops with multiple consumers, as in: ``` DQ -> Transpose --+--> Sigmoid -> | +--> Relu -> | +-> graph_output ``` With the updates in this PR, the above model can be transformed to: ``` DQ -> Transpose -> Q --+--> DQ -> Sigmoid -> | +--> DQ -> Relu -> | +--> DQ -> graph_output ``` This update allows QNN EP to support quantized models created with tools that do not wrap data movement ops in Q/DQ ops. --------- Co-authored-by: Edward Chen <[email protected]>

### Description  ### Motivation and Context

Allow importing camelcase names in lowercase

### Description Add OVEP features for 1.19 The PR has, - Added support for EpCtx with ORT Session options for optimized performance. - Added bug fixes - Support for OV 2024.3 --------- Co-authored-by: ubuntu <[email protected]> Co-authored-by: vthaniel <[email protected]> Co-authored-by: sfatimar <[email protected]> Co-authored-by: saurabhkale17 <[email protected]> Co-authored-by: Maheshkar <[email protected]>

### Description  We found text format could caused error. ### Motivation and Context  Because the OS could change the string so we decided to save it as binary file.

Updating Performance issue template so "performance" label is automatically applied ### Description  ### Motivation and Context

### Description * Swap cuda version 11.8/12.2 in GPU CIs * Set CUDA12 as default version in yamls of publishing nuget/python/java GPU packages * Suppress warnings as errors of flash_api.cc during ort win-build

#21485) ### Description Right now our "Zip-Nuget-Java-Nodejs Packaging Pipeline" is too big. This OnDevice training part is independent of the others, so it can be split out. Then our NPM Packaging pipeline will not depends on this training stuff. ### Motivation and Context Similar to #21235 Also, this PR fixed a problem that: "NuGet_Test_Linux_Training_CPU" job downloads artifacts from "onnxruntime-linux-x64" for getting customop shared libs, but the job forget to declare it depends on the "Linux_C_API_Packaging_CPU_x64" which produces the artifact. Such problems can be hard to find when a pipeline goes big.

### Description Qnn BatchNorm support input with rank 2 Update Quantization script to quantize BatchNorm bias using int32 --------- Co-authored-by: Justin Chu <[email protected]>

### Description  Current failure is due to a version mismatch. Use llvm-cov from the Android NDK instead of the system gcov so that the version is correct. Also comment out publishing to the Azure dashboard to simplify the setup. The CI prints out the stats for review by developers. ### Motivation and Context  Fix CI pipeline

…#21472) ### Description  Add these changes to one PR to simplify checkin - Add Concat (#21423) - Add DepthToSpace (#21426) - Add LeakyRelu (#21453) - Add test scripts (#21427) - Add ability to set coreml flags from python (#21434) Other changes - updated partitioning utils to support dropping constant initializers from a ComputeCapability's inputs. - noticed that the list of inputs to the coreml model was unexpectedly long due to this - we copy constant initializers to a CoreML model so don't need the originals, and if they remain as inputs ORT can't free them as they appear to be in use. ### Motivation and Context

Allow cpplint to always be green since it is optional. Also changed the workflow name to reflect that.

…ns (#21502) ### Description  Current behavior forces all L2 optimizers to loop until they hit the max number of iterations. Only update modified if the graph was modified. ### Motivation and Context  Fix unnecessary loops of L2 optimizers during model loading.

### Description This PR registers the ReduceMin-20 operator to the DML EP. ### Motivation and Context

WebNN spec recently changes the definition of argMax/argMin: - Remove selectLastIndex option, let backends decide to select the last index or not. - Move axes option to axis input

### Description  `enable_windows_arm64_qnn` and `enable_windows_x64_qnn` are true by default but unnecessary for training. This change explicitly sets these parameters to false for training pipeline. ### Motivation and Context  ORT 1.19 Release Preparation

The only place where I manually fixed I forgot a format string

### Description Separating all GPU stages into different Pipelines

… prefix (#21236) ### Description Add QNN EP option context_node_name_prefix to set EPContext node name prefix ### Motivation and Context For the case to workaround QNN context PD memory limit, user need split the model into pieces and generate the QNN context model separately. It could happen that the generated EPContext node in separate graph has same node name. This will cause issue if glue those EPContext nodes together into a single model. To avoid this user can set this context_node_name_prefix for each split pieces to make the node name unique.

### Description * Update benchmark_mha.py to compare with PyTorch SDPA api. * Write results to csv file. * Use sdpa_kernel cuda provider option instead of environment variables for better control. * Add arguments (`--use_gpu`, `--causal` etc) to allow testing different senarios. * Update benchmark_mha.sh to add cpu benchmarks For Q,K,V format, torch uses BNSH format, while ort uses BSNH format, so the result is not apple-to-apple. However, if the latency difference is large, that could be a warning. #### Example GPU results Example results on A100-SXM4-80GB with settings (use_gpu=TRUE, enable_cuda_graph=FALSE, causal=FALSE, past_sequence_length=0, intra_op_num_threads=0) in Azure Linux. ORT: build from source with CUDA 12.5; PyTorch 2.3.1 for cuda 12.1. format | batch_size | sequence_length | num_heads | head_size | latency (s) | tflops | kernel -- | -- | -- | -- | -- | -- | -- | -- Q,KV | 4 | 2048 | 32 | 128 | 0.0015 | 179.5 | ort:flash Q,KV | 4 | 2048 | 32 | 128 | 0.0015 | 179.0 | ort:default Q,K,V | 4 | 2048 | 32 | 128 | 0.0016 | 170.0 | ort:default Q,K,V | 4 | 2048 | 32 | 128 | 0.0016 | 169.5 | ort:flash QKV | 4 | 2048 | 32 | 128 | 0.0016 | 168.5 | ort:default QKV | 4 | 2048 | 32 | 128 | 0.0016 | 167.4 | ort:flash Q,K,V | 4 | 2048 | 32 | 128 | 0.0017 | 159.4 | torch:default Q,K,V | 4 | 2048 | 32 | 128 | 0.0018 | 155.0 | torch:flash Q,KV | 4 | 2048 | 32 | 128 | 0.0030 | 92.7 | ort:efficient Q,K,V | 4 | 2048 | 32 | 128 | 0.0030 | 90.9 | ort:efficient QKV | 4 | 2048 | 32 | 128 | 0.0031 | 89.9 | ort:efficient Q,K,V | 4 | 2048 | 32 | 128 | 0.0031 | 89.0 | torch:efficient Q,K,V | 4 | 2048 | 32 | 128 | 0.0054 | 51.3 | torch:math Q,KV | 4 | 4096 | 32 | 128 | 0.0058 | 191.0 | ort:default Q,KV | 4 | 4096 | 32 | 128 | 0.0058 | 190.6 | ort:flash Q,K,V | 4 | 4096 | 32 | 128 | 0.0059 | 187.8 | ort:default Q,K,V | 4 | 4096 | 32 | 128 | 0.0059 | 186.7 | ort:flash QKV | 4 | 4096 | 32 | 128 | 0.0059 | 185.9 | ort:flash QKV | 4 | 4096 | 32 | 128 | 0.0059 | 185.8 | ort:default Q,K,V | 4 | 4096 | 32 | 128 | 0.0067 | 163.4 | torch:default Q,K,V | 4 | 4096 | 32 | 128 | 0.0070 | 157.2 | torch:flash Q,KV | 4 | 4096 | 32 | 128 | 0.0113 | 97.6 | ort:efficient Q,K,V | 4 | 4096 | 32 | 128 | 0.0114 | 96.4 | ort:efficient QKV | 4 | 4096 | 32 | 128 | 0.0114 | 96.2 | ort:efficient Q,K,V | 4 | 4096 | 32 | 128 | 0.0127 | 86.3 | torch:efficient Q,KV | 8 | 2048 | 32 | 128 | 0.0031 | 177.8 | ort:flash Q,KV | 8 | 2048 | 32 | 128 | 0.0031 | 177.7 | ort:default Q,K,V | 8 | 2048 | 32 | 128 | 0.0032 | 170.8 | ort:default Q,K,V | 8 | 2048 | 32 | 128 | 0.0032 | 170.3 | ort:flash QKV | 8 | 2048 | 32 | 128 | 0.0032 | 169.2 | ort:default QKV | 8 | 2048 | 32 | 128 | 0.0033 | 169.0 | ort:flash Q,K,V | 8 | 2048 | 32 | 128 | 0.0034 | 161.9 | torch:default Q,K,V | 8 | 2048 | 32 | 128 | 0.0036 | 152.9 | torch:flash Q,KV | 8 | 2048 | 32 | 128 | 0.0059 | 93.5 | ort:efficient Q,K,V | 8 | 2048 | 32 | 128 | 0.0060 | 91.3 | ort:efficient QKV | 8 | 2048 | 32 | 128 | 0.0060 | 91.0 | ort:efficient Q,K,V | 8 | 2048 | 32 | 128 | 0.0064 | 86.0 | torch:efficient Q,KV | 8 | 4096 | 32 | 128 | 0.0115 | 190.8 | ort:flash Q,KV | 8 | 4096 | 32 | 128 | 0.0115 | 190.7 | ort:default Q,K,V | 8 | 4096 | 32 | 128 | 0.0118 | 187.1 | ort:default Q,K,V | 8 | 4096 | 32 | 128 | 0.0118 | 187.0 | ort:flash QKV | 8 | 4096 | 32 | 128 | 0.0118 | 185.6 | ort:default QKV | 8 | 4096 | 32 | 128 | 0.0118 | 185.6 | ort:flash Q,K,V | 8 | 4096 | 32 | 128 | 0.0139 | 158.7 | torch:default Q,K,V | 8 | 4096 | 32 | 128 | 0.0139 | 158.3 | torch:flash Q,KV | 8 | 4096 | 32 | 128 | 0.0225 | 97.7 | ort:efficient Q,K,V | 8 | 4096 | 32 | 128 | 0.0227 | 96.8 | ort:efficient QKV | 8 | 4096 | 32 | 128 | 0.0228 | 96.3 | ort:efficient Q,K,V | 8 | 4096 | 32 | 128 | 0.0260 | 84.5 | torch:efficient #### Example CPU results Dell XPS 8960 with i9-13900 CPU (use_gpu=FALSE, causal=FALSE, past_sequence_length=0) in Windows. ORT: build from source with CUDA 12.5; PyTorch 2.3.1 for cuda 12.1. format | causal | batch_size | seq_len | num_heads | head_size | threads | latency (s) | kernel -- | -- | -- | -- | -- | -- | -- | -- | -- Q,K,V | FALSE | 1 | 128 | 32 | 128 | 8 | 0.0005 | ort:flash Q,K,V | FALSE | 1 | 128 | 32 | 128 | 0 | 0.0009 | ort:flash Q,K,V | FALSE | 1 | 128 | 32 | 128 | 0 | 0.0009 | ort:math Q,K,V | FALSE | 1 | 128 | 32 | 128 | 4 | 0.0009 | ort:flash Q,K,V | FALSE | 1 | 128 | 32 | 128 | 2 | 0.0014 | ort:flash Q,K,V | FALSE | 1 | 128 | 32 | 128 | 1 | 0.0025 | ort:flash Q,K,V | FALSE | 1 | 128 | 32 | 128 | 2 | 0.0045 | torch:default Q,K,V | FALSE | 1 | 128 | 32 | 128 | 24 | 0.0046 | torch:default Q,K,V | FALSE | 1 | 128 | 32 | 128 | 8 | 0.0046 | torch:default Q,K,V | FALSE | 1 | 128 | 32 | 128 | 4 | 0.0046 | torch:default Q,K,V | FALSE | 1 | 128 | 32 | 128 | 1 | 0.0047 | torch:default Q,K,V | FALSE | 1 | 256 | 32 | 128 | 0 | 0.0019 | ort:flash Q,K,V | FALSE | 1 | 256 | 32 | 128 | 8 | 0.0019 | ort:flash Q,K,V | FALSE | 1 | 256 | 32 | 128 | 0 | 0.0022 | ort:math Q,K,V | FALSE | 1 | 256 | 32 | 128 | 4 | 0.0030 | ort:flash Q,K,V | FALSE | 1 | 256 | 32 | 128 | 2 | 0.0047 | ort:flash Q,K,V | FALSE | 1 | 256 | 32 | 128 | 1 | 0.0086 | ort:flash Q,K,V | FALSE | 1 | 256 | 32 | 128 | 2 | 0.0161 | torch:default Q,K,V | FALSE | 1 | 256 | 32 | 128 | 4 | 0.0162 | torch:default Q,K,V | FALSE | 1 | 256 | 32 | 128 | 8 | 0.0162 | torch:default Q,K,V | FALSE | 1 | 256 | 32 | 128 | 24 | 0.0165 | torch:default Q,K,V | FALSE | 1 | 256 | 32 | 128 | 1 | 0.0166 | torch:default Q,K,V | FALSE | 1 | 512 | 32 | 128 | 8 | 0.0077 | ort:flash Q,K,V | FALSE | 1 | 512 | 32 | 128 | 0 | 0.0091 | ort:flash Q,K,V | FALSE | 1 | 512 | 32 | 128 | 0 | 0.0099 | ort:math Q,K,V | FALSE | 1 | 512 | 32 | 128 | 4 | 0.0103 | ort:flash Q,K,V | FALSE | 1 | 512 | 32 | 128 | 2 | 0.0177 | ort:flash Q,K,V | FALSE | 1 | 512 | 32 | 128 | 1 | 0.0328 | ort:flash Q,K,V | FALSE | 1 | 512 | 32 | 128 | 2 | 0.0624 | torch:default Q,K,V | FALSE | 1 | 512 | 32 | 128 | 4 | 0.0624 | torch:default Q,K,V | FALSE | 1 | 512 | 32 | 128 | 8 | 0.0625 | torch:default Q,K,V | FALSE | 1 | 512 | 32 | 128 | 24 | 0.0626 | torch:default Q,K,V | FALSE | 1 | 512 | 32 | 128 | 1 | 0.0640 | torch:default Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 8 | 0.0286 | ort:flash Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 0 | 0.0317 | ort:flash Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 4 | 0.0367 | ort:flash Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 0 | 0.0391 | ort:math Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 2 | 0.0656 | ort:flash Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 1 | 0.1235 | ort:flash Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 24 | 0.2482 | torch:default Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 2 | 0.2483 | torch:default Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 4 | 0.2483 | torch:default Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 8 | 0.2486 | torch:default Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 1 | 0.2538 | torch:default Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 0 | 0.1038 | ort:flash Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 8 | 0.1050 | ort:flash Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 0 | 0.1368 | ort:math Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 4 | 0.1535 | ort:flash Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 2 | 0.2461 | ort:flash Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 1 | 0.4724 | ort:flash Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 8 | 0.9835 | torch:default Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 4 | 0.9841 | torch:default Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 24 | 0.9841 | torch:default Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 2 | 0.9873 | torch:default Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 1 | 0.9985 | torch:default ### Motivation and Context To compare with PyTorch SDPA on CPU and CUDA latency.

### Description  Set version and other info in the Microsoft.ML.OnnxRuntime C# dll by setting GenerateAssemblyInfo to true and passing in ORT version in the CI. Minor re-org of the order of properties so related things are grouped a little better. ### Motivation and Context  #21475

### Description  [VitisAI] 1. KernelDef supports StartVersion and EndVersion 2. CapabilityOps checks domain ### Motivation and Context  Co-authored-by: Zhenze Wang <[email protected]>

### Description  1. We decided to move the context node creation back to our own repo because it is more flexible to modify. 2. We found a bug related the context node. It would change the inference order. So, we fixed in this PR as well. ### Motivation and Context  This is crucial for Microsoft Release next month. --------- Co-authored-by: Yueqing Zhang <[email protected]>

The change in #21005 works for directly building wheels with `build.py`, but ort-nightly-directml wheels, as well as the 1.18.1 release of the onnxruntime-directml python wheel, still do not work with conda since they're built from the `py-win-gpu.yml` pipeline, which uses `install_third_party_deps.ps1` to set compile flags.

…soft.ML.OnnxRuntime.ResNet50v2Sample (#21444) Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp) from 2.1.8 to 2.1.9. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's releases</a>.</em></p> <blockquote> <h2>v2.1.9</h2> <h2>What's Changed</h2> <ul> <li>[2.1] Fix overflow in MemoryAllocator.Create(options) by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2732">SixLabors/ImageSharp#2732</a></li> <li>Backport GIF LZW fix to 2.1 by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2756">SixLabors/ImageSharp#2756</a></li> <li>Backport 2759 to 2.1.x by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2770">SixLabors/ImageSharp#2770</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9">https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/SixLabors/ImageSharp/commit/9816ca45016c5d3859986f3c600e8934bc450a56"><code>9816ca4</code></a> Merge pull request <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2770">#2770</a> from SixLabors/af/backport-2759-2.1.x</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/b33d666ab725c8ae14f38c98ee5dfc4645753b16"><code>b33d666</code></a> handle DecodingMode</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/6b2030b54927b09ed65f782c994d5c9faa7cef27"><code>6b2030b</code></a> Merge branch 'release/2.1.x' into af/backport-2759-2.1.x</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/8ffad3f480ebe8c5b432bb24fe8377096eeb733b"><code>8ffad3f</code></a> Issue2012BadMinCode should decode now</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/1f5bf23b9e81f2dbeb51ed54f13cb3da94e67b6f"><code>1f5bf23</code></a> skip Issue2758_DecodeWorks</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/3bf8c572a0d82f18e005bf9882106552218a2c37"><code>3bf8c57</code></a> manual port of 3.1 gif decoder</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/28c20ded87e2d81477a08a48e0d3a0717b3c4d5a"><code>28c20de</code></a> Clamp JPEG quality estimation results.</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/4b910e7f8400d89f1845761650cf64df687e73d5"><code>4b910e7</code></a> Decode LZW row by row</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/a1f287977139109a987065643b8172c748abdadb"><code>a1f2879</code></a> Merge pull request <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2756">#2756</a> from SixLabors/af/git-av-2.1</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/898df7f8ca51b2163cff0d697e2be44682266f0c"><code>898df7f</code></a> backport <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2749">#2749</a> to 2.1</li> <li>Additional commits viewable in <a href="https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Sixlabors.ImageSharp&package-manager=nuget&previous-version=2.1.8&new-version=2.1.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

### Description Security fuzz test with address sanitizer found several bugs

### Description Local CI setup for AIX reported tests failure after the gtest 1.15.0 upgrade. ### Motivation and Context Below tests failure is observed after gtest upgrade. The following tests FAILED: 1 - onnxruntime_test_all (ILLEGAL) 7 - onnxruntime_logging_apis_test (Subprocess aborted) To fix this, I am enabling pthread support under gtest. This was disabled with previous version of gtest for some reason. Now by enabling this, above tests are getting passed with gtest 1.15.0.

…21529) ### Description Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml ### Motivation and Context This CI pipeline has been divided into 4 different pipeline.

…:convPoolShapeInference (#21507) ### Description onnx 1.16.2 is not available before ort 1.19.0 code freeze. Thus pick the needed change as patch

### Description enable float16 support for Node.js binding. data of float16 tensor uses `Uint16Array`.

### Description Original argument accepts Enum QuantFormat.QOperator or QuantFormat.QDQ, but the default value is QOperator. Change the argument to str to accept QOperator or QDQ and convert to QuantFormat after parsing. ### Motivation and Context Bug fix

### Description Masks off top 4-bits of INT4 weights, improving accuracy. ### Motivation and Context This is a workaround as the QNN docs state masking is not required.

### Description The header files were added in PR #16454. Then, recently I made a PR #21464 that changed how we packed Linux tarballs. The new tarball misses the custom op header files. Therefore I need to make this change. ### Motivation and Context

…er (#21505) Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.1 to 2.2.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/pytorch/pytorch/releases">torch's releases</a>.</em></p> <blockquote> <h2>PyTorch 2.2: FlashAttention-v2, AOTInductor</h2> <h1>PyTorch 2.2 Release Notes</h1> <ul> <li>Highlights</li> <li>Backwards Incompatible Changes</li> <li>Deprecations</li> <li>New Features</li> <li>Improvements</li> <li>Bug fixes</li> <li>Performance</li> <li>Documentation</li> </ul> <h1>Highlights</h1> <p>We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2 offers ~2x performance improvements to <code>scaled_dot_product_attention</code> via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.</p> <p>This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.</p> <p><strong>Please note that we are <a href="https://redirect.github.com/pytorch/pytorch/issues/114602">deprecating macOS x86 support</a>, and PyTorch 2.2.x will be the last version that supports macOS x64.</strong></p> <p>Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.</p> <p>This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our <a href="https://pytorch.org/get-started/pytorch-2.0/">Getting Started</a> page.</p> <p>Summary:</p> <ul> <li><code>scaled_dot_product_attention</code> (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.</li> <li>PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.</li> <li><code>torch.distributed</code> supports a new abstraction for initializing and representing ProcessGroups called device_mesh.</li> <li>PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.</li> <li>A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.</li> <li>Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.</li> <li><code>torch.ao.quantization</code> now offers a prototype <code>torch.export</code> based flow</li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/pytorch/pytorch/commit/8ac9b20d4b090c213799e81acf48a55ea8d437d6"><code>8ac9b20</code></a> Run docker release build on final tag (<a href="https://redirect.github.com/pytorch/pytorch/issues/117131">#117131</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/117182">#117182</a>)</li> <li><a href="https://github.com/pytorch/pytorch/commit/2490352430a19c42cf2a51a043f63c33df1280d6"><code>2490352</code></a> Fix cuInit test on Windows (<a href="https://redirect.github.com/pytorch/pytorch/issues/117095">#117095</a>)</li> <li><a href="https://github.com/pytorch/pytorch/commit/3a44bb713f528880482f56d9523a9cf2628d0534"><code>3a44bb7</code></a> [CI] Test that cuInit is not called during import (<a href="https://redirect.github.com/pytorch/pytorch/issues/117043">#117043</a>)</li> <li><a href="https://github.com/pytorch/pytorch/commit/1c8ba3847d47aa17727e98eee58a606e4a763a58"><code>1c8ba38</code></a> [CI] Use jemalloc for CUDA builds (<a href="https://redirect.github.com/pytorch/pytorch/issues/116900">#116900</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116988">#116988</a>)</li> <li><a href="https://github.com/pytorch/pytorch/commit/96d2ddbafe3a054ed2f8de5b192045e02f2dfd0f"><code>96d2ddb</code></a> Store user model to simplify ONNXProgram.{adapt_torch_*,<strong>call</strong>} APIs (<a href="https://redirect.github.com/pytorch/pytorch/issues/1152">#1152</a>...</li> <li><a href="https://github.com/pytorch/pytorch/commit/738b4a560a25e1ff5b9f551072b14247fbd8a15b"><code>738b4a5</code></a> Update ONNX's IO Adapter to support FakeTensor with ExportedProgram (<a href="https://redirect.github.com/pytorch/pytorch/issues/114407">#114407</a>)...</li> <li><a href="https://github.com/pytorch/pytorch/commit/4cf10bf4dc0f94fa5f556e5ac68e829d870c26cd"><code>4cf10bf</code></a> [Cherry-pick] [Quant] [PT2] Enable batchnorm in _move_exported_model_to_eval ...</li> <li><a href="https://github.com/pytorch/pytorch/commit/7e97e4b4b6dff5932ea7aa9e22640c3e4c3dadcb"><code>7e97e4b</code></a> [AARCH64] Fall back to GEMM if mkldnn_matmul fails (<a href="https://redirect.github.com/pytorch/pytorch/issues/115936">#115936</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116666">#116666</a>)</li> <li><a href="https://github.com/pytorch/pytorch/commit/1a3e3c7cffac8b985ef8bd8dff9891c63c51e830"><code>1a3e3c7</code></a> [CUDA] baddmm should fall back to addmm for batch=1 (<a href="https://redirect.github.com/pytorch/pytorch/issues/114992">#114992</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116518">#116518</a>)</li> <li><a href="https://github.com/pytorch/pytorch/commit/ab7505f78c678445d628f0f84b7c1717ca3929a0"><code>ab7505f</code></a> Fix broken PyYAML 6.0 on MacOS x86 (<a href="https://redirect.github.com/pytorch/pytorch/issues/115956">#115956</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116551">#116551</a>)</li> <li>Additional commits viewable in <a href="https://github.com/pytorch/pytorch/compare/v1.13.1...v2.2.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=torch&package-manager=pip&previous-version=1.13.1&new-version=2.2.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Currently WebNN spec only allows MLGraphBuilder.build() to be called once, we need to create new builder for every subgraph in WebNN EP. Spec change: webmachinelearning/webnn#717

### Description model: phi-3-mini-4k-instruct avx2 symmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |49.5|70.0|-29.2%|9.6|10.8|-34.2% 32 |76.8|52.4|9.7%|15.2|14.6|4.1% 64 |78.2|71.4|9.5%|16.6|16.3|1.8% 128 |72.9|70.6|3.2%|17.1|16.8|1.7% 256 |83.7|63.6|31.6%|18.1|17.4|4% avx2 asymmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |50.7|61.5|-17.5%|9.6|9.2|4.3% 32 |77.4|52.4|47.7%|14.6|13.9|5.0% 64 |78.7|63.0|24.9%|16.2|15.9|1.8% 128 |80.0|61.9|29.2%|17.2|16.9|1.7% 256 |81.5|63.3|28.7%|17.9|17.3|3.4% avx2vnni symmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |82.9|117.0|-29.0%|15.9|19.3|-17.6% 32 |133.0|100.4|32.4%|26.1|24.5|6.5% 64 |166.9|118.8|40.4%|28.3|27.1|4.4% 128 |165.9|119.6|38.7%|29.3|28.5|2.8% 256 |165.2|119.6|38.1%|30.2|29.0|4.1% avx2vnni asymmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |80.2|118.9|-32.5%|15.1|16.7|-9.5% 32 |130.7|99.7|31.0%|25.0|23.8|5.0% 64 |168.7|124.9|35.0%|27.3|26.8|1.8% 128 |169.6|123.8|36.9%|29.2|27.9|4.6% 256 |175.0|125.7|39.0%|30.0|29.7|1.0% avx512 symmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |135.2|156.5|-13.6|25.5|23.8|7.1 32 |150.0|159.5|-5.9|34.9|29.6|17.9 64 |167.5|157.5|6.3|39.7|34.4|15.4 128 |177.8|158.0|12.5|40.3|35.4|13.8 256 |182.6|157.3|16.0|41.7|37.7|10.6 avx512 asymmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |136.1|151.4|-10.1%|26.1|19.9|31.1% 32 |150.0|157.8|-4.9%|34.3|29.3|17.0% 64 |165.7|156.6|5.8%|38.7|30.7|26.0% 128 |180.4|156.6|15.1%|40.2|34.7|15.8% 256 |181.3|158.0|14.7%|41.6|36.6|13.6% avx512vnni symmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |143.4|155.4|-7.7%|25.6|23.3|9.8% 32 |159.2|157.0|1.4%|34.1|29.8|14.4% 64 |182.0|159.5|14.1%|38.4|34.8|10.3% 128 |221.2|160.8|37.5%|41.0|36.4|12.6% 256 |250.5|162.4|54.2%|41.6|37.7|10.3% avx512vnni asymmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |142.5|152.3|-6.4%|26.3|19.7|33.5% 32 |158.2|155.0|2.0%|34.3|29.2|17.4% 64 |184.1|156.6|17.5%|38.3|30.9|23.9% 128 |215.8|156.1|17.5%|41.3|35.0|17.9% 256 |249.2|155.9|59.8%|41.1|36.3|13.2% 4bit gemm implementation with avx using tile. 1. tile size is 2blk by 4. in case of size less then tile, it reduce to 1blk by 4, 2blk by 1 and lastly 1blk by 1. with internal kernel, weight and activation are loaded based on SIMD register width and blk length: avx2 256bit register, 64 weights and activation are loaded. blklen16: 4 blks are computed by the internal kernel blklen32: 2 blks are computed by the internal kernel blklen64: 1 blk are computed by the internal kernel blklen128: 1 blks are computed 2 times by the internal kernel blklen16: 1 blks are computed 4 times by the internal kernel avx512 512bit register, 128 weights and activation are loaded. blklen16: 8 blks are computed by the internal kernel blklen32: 4 blks are computed by the internal kernel blklen64: 2 blk are computed by the internal kernel blklen128: 1 blks are computed by the internal kernel blklen16: 1 blks are computed 2 times by the internal kernel 2. blksum is precomputed during prepacking. computation is reformed: Sum1(scale_a * scale_b * Sum_blk(a_i * b_i)) + Sum2(blksum_a * blksum_b) Sum_blk is over one blk Sum1 is over all blks for one output Sum2 is over all blks for one output Sum is computed with sgemm with the current implementation. Further improvement is possible. --------- Signed-off-by: Liqun Fu <[email protected]> Signed-off-by: liqunfu <[email protected]> Signed-off-by: Liqun Fu <[email protected]>

### Description - Supports quantized Conv + Activation on the HTP backend: - Translates `DQs -> Conv -> Relu/Clip -> Q` into a single QNN Conv operator if the Relu (or Clip) are redundant. ### Motivation and Context Expands support for QDQ models created with tools that do not wrap Relu or Clip with QDQ nodes. This PR introduces the `IQnnNodeGroup` class. In the same way that a `NodeUnit` represents a collection of `Nodes`, a `IQnnNodeGroup` can represent one or more `NodeUnits` that are translated into a QNN operator. QNN EP parses the ONNX graph to create a list of `IQnnNodeGroup` objects, each representing a single `NodeUnit` or a fusion of multiple `NodeUnits`.

### Description Added CUDNN Frontend and used it for NHWC convolutions, and optionally fuse activation. #### Backward compatible - For model existed with FusedConv, model can still run. - If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used. #### Major Changes - For cuDNN 9, we will enable cudnn frontend to fuse convolution and bias when a provider option `fuse_conv_bias=1`. - Remove the fusion of FusedConv from graph transformer for CUDA provider, so there will not be FusedConv be added to graph for CUDA EP in the future. - Update cmake files regarding to cudnn settings. The search order of CUDNN installation in build are like the following: * environment variable `CUDNN_PATH` * `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from build.py/build.sh, user can pass it through `--cudnn_home` parameter, or by environment variable `CUDNN_HOME` if `--cudnn_home` not used. * cudnn python package installation directory like python3.xx/site-packages/nvidia/cudnn * CUDA installation path #### Potential Issues - If ORT is built with cuDNN 8, FusedConv fusion is no longer done automatically, so some model might have performance regression. If user still wants FusedConv operator for performance reason, they can still have multiple ways to walkaround: like use older version of onnxruntime; or use older version of ORT to save optimized onnx, then run with latest version of ORT. We believe that majority users have moved to cudnn 9 when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3 months when 1.20 release), so the impact is small. - cuDNN graph uses TF32 by default, and user cannot disable TF32 through the use_tf32 cuda provider option. If user encounters accuracy issue (like in testing), user has to set environment variable `NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of use_tf32 later. #### Follow ups This is one of PRs that target to enable NHWC convolution in CUDA EP by default if device supports it. There are other changes will follow up to make it possible. (1) Enable `prefer_nhwc` by default for device with sm >= 70. (2) Change `fuse_conv_bias=1` by default after more testing. (3) Add other NHWC operators (like Resize or UpSample). ### Motivation and Context The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution with the pointwise bias operation. On the [NVIDIA ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/) we get a performance boost from 49.1144 ms to 42.4643 ms per inference on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i 'prefer_nhwc|1' resnet50.onnx`). --------- Co-authored-by: Tianlei Wu <[email protected]> Co-authored-by: Maximilian Mueller <[email protected]>

Add a check of node.InputDefs()[2]->Exists() for Layernorm bias (Follow up https://github.com/microsoft/onnxruntime/pull/21528/files#r1694026327) Format the file: break long line to be within 120 chars limit.

### Description  Changes to add in Set external data path for model weight files. Additional fixes to ensure this compiles off the latest v1.19 Onnxruntime ### Motivation and Context  Separate weights used for larger models (like stable diffusion) is motivation for this change set --------- Co-authored-by: Jeff Daily <[email protected]> Co-authored-by: Artur Wojcik <[email protected]> Co-authored-by: Ted Themistokleous <[email protected]>

### Description WebNN only supports test mode, so we don't care about other inputs or attributes about training mode, use WebNN's identity op to implement the Dropout op directly.

### Description Several tests result in segfaults during the minimal cuda build. Although test failures are expected due to the limitation of the minimal cuda EP, failing gracefully would be much preferred. ### Motivation and Context To reproduce: 1. Build ORT with: ```bash ./build.sh --build_shared_lib --use_full_protobuf --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --tensorrt_home /TensorRT-10.0.1.6 --parallel --skip_tests --skip_submodule_sync --allow_running_as_root --use_tensorrt --cmake_extra_defines onnxruntime_CUDA_MINIMAL=1 ``` 2. Run `onnxruntime_test_all` ```bash ... [----------] 1 test from AllocationPlannerTest [ RUN ] AllocationPlannerTest.ReusedInputCrossDifferentStreams Segmentation fault (core dumped) ```

…#21536) ### Description Refactor framework directory structure for MacOS packages ### Motivation and Context  Apple started enforcing specific [framework structure](https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Concepts/FrameworkAnatomy.html) for MacOS packages. We need to change how we package for MacOS to follow the guidelines Fixes following issue: [Malformed Framework](microsoft/onnxruntime-swift-package-manager#19 )

Bump up version in main from 1.19.0 to 1.20.0 since the release branch has been cut.

### Description  Add ability to test packaging without rebuilding every time. Add ability to comment out some platforms/architectures without the scripts to assemble the c/obj-c packages breaking. Update a couple of commands to preserve symlinks. ### Motivation and Context  Make debugging packaging issues faster. Creates correct package for mac-catalyst and doesn't require setting symlinks via bash script.

The mobile packages have been removed.

### Description  update script with cmake 3.30 to unblock EP Perf ### Motivation and Context

) ### Description Fix 2 typos in mlas avx 4bit gemm implementation to call correct vnni functions under vnni condition ### Motivation and Context needed for 1.19.0 release Signed-off-by: liqunfu <[email protected]>

… transient connection exceptions. (#21612) ### Description Improve docker commands to make docker image layer caching works. It can make docker building faster and more stable. So far, A100 pool's system disk is too small to use docker cache. We won't use pipeline cache for docker image and remove some legacy code. ### Motivation and Context There are often an exception of ``` 64.58 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail 286.4 curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2) ``` Because Onnxruntime pipeline have been sending too many requests to download Nodejs in docker building. Which is the major reason of pipeline failing now In fact, docker image layer caching never works. We can always see the scrips are still running ``` #9 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts #9 0.234 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) #9 0.235 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) #9 0.235 /tmp/scripts/install_centos.sh: line 1: !/bin/bash: No such file or directory #9 0.235 ++ '[' '!' -f /etc/yum.repos.d/microsoft-prod.repo ']' #9 0.236 +++ tr -dc 0-9. #9 0.236 +++ cut -d . -f1 #9 0.238 ++ os_major_version=8 .... #9 60.41 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail #9 60.59 + return 0 ... ``` This PR is improving the docker command to make image layer caching work. Thus, CI won't send so many redundant request of downloading NodeJS. ``` #9 [2/5] ADD scripts /tmp/scripts #9 CACHED #10 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts #10 CACHED #11 [4/5] RUN adduser --uid 1000 onnxruntimedev #11 CACHED #12 [5/5] WORKDIR /home/onnxruntimedev #12 CACHED ``` ###Reference https://docs.docker.com/build/drivers/ --------- Co-authored-by: Yi Zhang <[email protected]>

### Description - Update pipelines to use QNN SDK 2.25 by default - Update ifdef condition to apply workaround for QNN LayerNorm validation bug to QNN SDK 2.25 (as well as 2.24) ### Motivation and Context Use the latest QNN SDK

Fix usability checker CoreML config file path. The files got renamed but one place was still referring to the old name.

### Description  Improve speed in combining `per-channel` data for using a single `np.concatenate` instead of multiple `np.concatenates` within a for loop. ### Motivation and Context  Fix the issue #21562 Signed-off-by: duansheng.liu <[email protected]>

### Description  ### Motivation and Context To fix whisper test failure

github-advanced-security · 2024-08-07T20:55:23Z

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

github-advanced-security

PREfast found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

adrianlizarraga and others added 30 commits July 24, 2024 16:39

Fix security issue #22016 #22017 #22018 (#21333)

08001d1

### Description  ### Motivation and Context

Ignore ruff rule N813 (#21477)

ae3ec2e

Allow importing camelcase names in lowercase

Set CUDA12 as default in GPU packages (#21438)

ebcb707

### Description * Swap cuda version 11.8/12.2 in GPU CIs * Set CUDA12 as default version in yamls of publishing nuget/python/java GPU packages * Suppress warnings as errors of flash_api.cc during ort win-build

Qnn batchnorm support input with rank 2 (#21469)

c235178

### Description Qnn BatchNorm support input with rank 2 Update Quantization script to quantize BatchNorm bias using int32 --------- Co-authored-by: Justin Chu <[email protected]>

Allow cpplint to always be green (#21491)

c464ab3

Allow cpplint to always be green since it is optional. Also changed the workflow name to reflect that.

[DML EP] Register ReduceMin-20 (#20477)

1668094

### Description This PR registers the ReduceMin-20 operator to the DML EP. ### Motivation and Context

[WebNN EP] Update argMax/argMin to adapt to latest spec (#21452)

b6b2930

WebNN spec recently changes the definition of argMax/argMin: - Remove selectLastIndex option, let backends decide to select the last index or not. - Move axes option to axis input

Update text formatting in generate_cgmanifest.py (#21489)

bbbaef3

The only place where I manually fixed I forgot a format string

Separating all GPU stages into different Pipelines (#21521)

7db7c4e

### Description Separating all GPU stages into different Pipelines

Security fuzz address sanitizer fix Bug #2 and #3 (#21528)

48fb8a7

### Description Security fuzz test with address sanitizer found several bugs

Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml (#…

7e23212

…21529) ### Description Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml ### Motivation and Context This CI pipeline has been divided into 4 different pipeline.

pick changes from onnx/onnx#6195 to fix heap-buffer-overflow in onnx:…

a4d3a1c

…:convPoolShapeInference (#21507) ### Description onnx 1.16.2 is not available before ort 1.19.0 code freeze. Thus pick the needed change as patch

[js/node] enable float16 support for Node.js binding (#20581)

dbff0cd

### Description enable float16 support for Node.js binding. data of float16 tensor uses `Uint16Array`.

fajin-corp and others added 24 commits July 31, 2024 15:30

[QNN EP] Improve INT4 accuracy (#21582)

4b8f6dc

### Description Masks off top 4-bits of INT4 weights, improving accuracy. ### Motivation and Context This is a workaround as the QNN docs state masking is not required.

[WebNN EP] Create MLGraphBuilder for every model builder (#21514)

8c2ee7b

Currently WebNN spec only allows MLGraphBuilder.build() to be called once, we need to create new builder for every subgraph in WebNN EP. Spec change: webmachinelearning/webnn#717

Add reduce kernels for bigger types (#21490)

d0a6f57

Security fuzz address sanitizer fix Bug (continue) (#21579)

54d6614

Add a check of node.InputDefs()[2]->Exists() for Layernorm bias (Follow up https://github.com/microsoft/onnxruntime/pull/21528/files#r1694026327) Format the file: break long line to be within 120 chars limit.

[WebNN EP] Support Dropout op (#21586)

8c641d7

### Description WebNN only supports test mode, so we don't care about other inputs or attributes about training mode, use WebNN's identity op to implement the Dropout op directly.

bumps up version in main from 1.19 -> 1.20 (#21588)

134f477

Bump up version in main from 1.19.0 to 1.20.0 since the release branch has been cut.

Clean up some mobile package related files and their usages. (#21606)

a5ce65d

The mobile packages have been removed.

[EP Perf] Update cmake (#21624)

1f907a2

### Description  update script with cmake 3.30 to unblock EP Perf ### Motivation and Context

[QNN EP] Update QNN SDK to 2.25 (#21623)

0acefc7

### Description - Update pipelines to use QNN SDK 2.25 by default - Update ifdef condition to apply workaround for QNN LayerNorm validation bug to QNN SDK 2.25 (as well as 2.24) ### Motivation and Context Use the latest QNN SDK

Fix usability checker CoreML config file path. (#21626)

4ad87ca

Fix usability checker CoreML config file path. The files got renamed but one place was still referring to the old name.

Pin transformer and optimum version (#21650)

621b16f

### Description  ### Motivation and Context To fix whisper test failure

sophies927 requested review from a team as code owners August 7, 2024 20:55

github-advanced-security bot found potential problems Aug 7, 2024

View reviewed changes

sophies927 merged commit 82f3d3a into update-labeling-workflow Aug 7, 2024
267 of 312 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync update-labeling-workflow with main #21658

Sync update-labeling-workflow with main #21658

sophies927 commented Aug 7, 2024

github-advanced-security bot commented Aug 7, 2024

github-advanced-security bot left a comment

Sync update-labeling-workflow with main #21658

Sync update-labeling-workflow with main #21658

Conversation

sophies927 commented Aug 7, 2024

Description

Motivation and Context

github-advanced-security bot commented Aug 7, 2024

github-advanced-security bot left a comment

Choose a reason for hiding this comment