Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick for 1.17.1 patch release #19477

Merged
merged 23 commits into from
Feb 21, 2024
Merged

Conversation

YUNQIUGUO
Copy link
Contributor

Description

As title.

Motivation and Context

petermcaughan and others added 6 commits February 8, 2024 15:28
### Description
There is a current bug in the BeamSearch implementation of T5, GPT, and
Whisper due to an interaction between two PRs merged in the past 7
months.

First PR/code change is the addition of BeamSearchScorer GPU
implementation. This PR accelerates some operations by executing them in
the GPU and not the CPU. The approach for this code change didn't
utilize a cudaStream when copying one particular variable from GPU to
CPU (see nullptr value here:
[[link](https://github.com/microsoft/onnxruntime/blob/b65d3d0a5374daa3bc9272c2c02763a8428660db/onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_t5.h#L213)]).

The second PR/code change was the alteration to utilize a cudaStream to
initialize various memory buffers in BeamSearch (see `stream` included
as the last argument in these allocations
[[link](https://github.com/microsoft/onnxruntime/blob/d1431e1b78fb81bf90fdc58c9118cb011171f387/onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_base.h#L25)]).

During the in-between period of these two PRs, I believe neither
allocation utilized a stream and were thus synchronized. Once the latter
PR was merged, the copy became desynchronized with the initialization
due to different streams.

The fix for this is to reintroduce the same stream into the copy
operation added in the first PR.



### Motivation and Context
This does not happen reliably on every hardware with every script due to
the race condition nature, but the bug completely breaks ORT execution
with a BeamSearch model.

---------

Co-authored-by: Peter McAughan <[email protected]>
### Description
- When converting ONNX split sizes to QNN split indices, do not include
the split at index 0. QNN 2.19 assumes index 0 is implicit and throws a
validation error if provided.
- Fix bug when using an ONNX Split operator with a `num_outputs`
attribute that does not evenly divide into `shape[axis]`. The ONNX spec
states that the last chunk should be smaller, but QNN EP made the last
chunk larger.
- Fix bug when using an ONNX Split operator with a `split` input. QNN EP
was incorrectly passing the split sizes as split indices without
conversion.

### Motivation and Context
QNN SDK 2.19 updated validation criteria for Split operators. QNN EP was
previously passing a split index that should have been implicit.

Also, discovered a bugs when using `num_outputs` attribute and `split`
input.
### Description
This change
55a6694
didn't take into account external data when unpacking initializer, and
therefore crashes when trying to unpack them.
### Description
Adds type/shape inferencing support for MSFT domain QuantizeLinear and
DequantizeLinear operators to symbolic_shape_infer.py


### Motivation and Context
Need a way to infer the types and shapes of Q/DQ ops in models that use
the MSFT domain versions (e.g., int16 quantization).
### Description
Updates qdq quantization to ensure the final model has the
`com.microsoft` opset import if the model uses Q/DQ ops with the
`com.microsoft` domain (e.g., for int16 quantization)


### Motivation and Context
Need to ensure the MSFT domain is correctly set for all relevant cases.
Otherwise, shape inferencing tools will raise an exception.
### Description
Only set thread affinity on Server with auto affinity. Auto affinity =
when API user does specify thread settings or affinity themselves.

### Motivation and Context
On client best to let OS scheduler handle. On big (P-Core) / little
(E-Core) CPU designs affinity overrides win32 Quality of Service (QoS)
and has high power usage. Specifically on background workloads whose
process is tagged QoS Utility (Background), this affinity setting
overrides the OS scheduler that only wants to schedule on the E-Cores.
Thus P-Cores waking up uses more energy than intended on client and
users gets less battery life.

Foreground AI workloads would be tagged QoS High and would run the ORT
threads on all cores.
PatriceVignola
PatriceVignola previously approved these changes Feb 8, 2024
@fs-eire
Copy link
Contributor

fs-eire commented Feb 8, 2024

Could you please help to cherry-pick this one: #19458

@YUNQIUGUO
Copy link
Contributor Author

Could you please help to cherry-pick this one: #19458

yes. this one was not merged just now so not being included. will cherry-pick it now.

### Description

Since TypeScript v4.7, types need to specify inside "exports" field when
it is available. This PR appends types just before each "default" (which
is required by spec to be the last item).

Fixes #19403.
@ivberg
Copy link
Contributor

ivberg commented Feb 9, 2024

Mind grabbing #19397 ?

…er options (#19397)

### Description
Add capturestate / rundown ETW support logging for session and provider
options.

### Motivation and Context
Follow-up to #16259 and #18882

This is very useful when you have longer running ONNX sessions which
will be the case for a lot of AI workloads. That means ETW tracing may
start minutes or hours after a process & session has been established.
When a trace is captured, you would want to know the state of ONNX at
that time. The state for ONNX is session and config options so that they
show up in the trace.

Tested with xperf and ORT 
xperf -start ort -on 3a26b1ff-7484-7484-7484-15261f42614d
xperf -capturestate ort 3a26b1ff-7484-7484-7484-15261f42614d <--- Run
this after session has been up for some time
xperf -stop ort -d .\ort.etl  <- Trace will now also have rundown events

Also these will show if you use WPR [CaptureStateOnSave
](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/capturestateonsave)
ivberg
ivberg previously approved these changes Feb 9, 2024
@PatriceVignola
Copy link
Contributor

@YUNQIUGUO When is the deadline for 1.17.1 changes? I have a blocking regression that I may have a fix for by tomorrow.

adrianlizarraga
adrianlizarraga previously approved these changes Feb 9, 2024
petermcaughan
petermcaughan previously approved these changes Feb 9, 2024
@YUNQIUGUO
Copy link
Contributor Author

YUNQIUGUO commented Feb 9, 2024

@YUNQIUGUO When is the deadline for 1.17.1 changes? I have a blocking regression that I may have a fix for by tomorrow.

tomorrow by EOD should be fine.

snnn
snnn previously approved these changes Feb 9, 2024
@YUNQIUGUO
Copy link
Contributor Author

YUNQIUGUO commented Feb 10, 2024

@PatriceVignola it's approaching the EOD. Just checking in - Is the pr ready and merged yet?

If not,I will go ahead and merge this cherry-pick pr for now. and maybe open a second round cherry-pick if necessary. I think we are waiting on another two prs too.

BTW, please share the pr link and tag with release 1.17.1 once it's ready. thanks!

@PatriceVignola
Copy link
Contributor

@PatriceVignola it's approaching the EOD. Just checking in - Is the pr ready and merged yet?

If not,I will go ahead and merge this cherry-pick pr for now. and maybe open a second round cherry-pick if necessary. I think we are waiting on another two prs too.

BTW, please share the pr link and tag with release 1.17.1 once it's ready. thanks!

The PR is here: #19481

It was sporadically failing pipelines unrelated to the change and I had to restart them.

@PatriceVignola
Copy link
Contributor

@YUNQIUGUO The PR has been merged into main and can be cherry-picked.

There's currently a bug in the allocation planner when reusing buffers
and more than one streams are used that make it possible (although
rarely) to reach a reference count of 0 for a buffer that is still being
used. Since DML doesn't benefit from multiple streams, disabling it is
the safest option for now.

This is a high priority issue that we need to fix for 1.17.1 since it
breaks stable diffusion. Identifying the perfect fix and fixing the
underlying issue would be too risky for a patch release, especially
given the limited time that we have.

#19480
As per title, fixes
#19418

ONNX Runtime 1.17 broke the quantization of ONNX models with subgraphs
where initializers are placed on the top-level graph, while different
subgraphs use the same initializer.
@smk2007
Copy link
Member

smk2007 commented Feb 14, 2024

Please add #19483

smk2007 and others added 3 commits February 14, 2024 10:47
### Description
Limit SoC core detection via 2 level cache core logic to Intel and
Hybrid processors.

### Motivation and Context
The following code was added to add support for a new class of CPU cores
present in Intel’s next generation Intel Core Ultra mobile processors.
This code is essential to avoid placing threads on low performing SoC
cores that don’t have L3 cache. SoC cores are meant to specialize in
system bringup and help improve responsiveness and power usage, in other
words they are not meant to run compute heavy AI workloads. In order to
avoid broad exposure of this logic, it is currently designed to be
restricted to Intel platforms that have hybrid enabled.

---------

Co-authored-by: Sheil Kumar <[email protected]>
### Description
<!-- Describe your changes. -->

Add ATen fallback support for bicubic interpolation algorithm.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Required for facebook/dinov2 model architecture as part of ONNX Runtime
integration with AML Vision models.
@smk2007
Copy link
Member

smk2007 commented Feb 16, 2024

Please add #19475. TY

smk2007 and others added 3 commits February 16, 2024 10:06
**Description**
1) During SessionInitialization, KahnsTopologicalSort is a major cause
of perf degradation.
The main cause of slow down is that the TopologicalSort needs to keep
track of nodes to visit in order, and reorder them based on priority (as
informed by a comparator). The existing implementation uses a
priority_queue that is backed by a std::vector container. However,
vectors are not good for insertion and reordering. The appropriate data
type for this operation is a linked list. However, linked lists like
std::list are not usable as a container for std::priority_queue. This is
because std::priority_queue requires random access, which linked lists
do not have. However, for this simple implementation, we can leverage a
std::list under the hood and perform insertions manually using
std::upper_bound. This drastically reduces the time taken by the method,
which currently instead causes numerous recopies and a lot of movement
inside the graph nodes to visit list.

2) In the comparator, I hide forward and backward attribute checking
behind the #ifdef ENABLE_TRAINING macro, as I believe it should only be
valid in the training scenario.

3) In noopelimination transformer, I prevent the creation of Initializer
(which unpacks tensorproto data) in every node and only create
initializers when Add/Sub/Mul/Div op nodes are detected.

**Motivation and Context**
Session creation time of many models is quite slow.

---------

Co-authored-by: Sheil Kumar <[email protected]>
Build from source and run the command below

Example, converting whisper-base
`
python -m onnxruntime.transformers.models.whisper.convert_to_onnx -m
openai/whisper-base --model_impl openai -e -o -w --chain_model --output
./demo`
This PR updates exporting and running the Whisper model with beam search
by adding the following.

- Adds temperature as a graph input to the exported model
- Fixes the token ids by adding them as attributes to
`WhisperBeamSearch`
- Fixes the timestamps test cases so they pass now
- Fixes a bug with invoking `torch.onnx.export`
- Cleans up the Whisper scripts and groups the arguments in
`convert_to_onnx.py`
- Adds a `requirements.txt` file to specify package dependencies
- Adds `whisper-large-v3` to list of pretrained models
- Fixes a bug with missing cross-attention KV cache inputs in the
decoder subgraph

- This is a follow-up to [this
PR](#19188).
- The incorrect token ids in the timestamps processor were first noticed
during [this PR
review](#17500 (comment)).
When they were originally added in [this
PR](#15853), the offsets
were previously constant across the Whisper model sizes. When comparing
the new `whisper-large-v3` variant, the English-only variants (e.g.
`whisper-tiny.en`), and the original variants (e.g. `whisper-tiny`),
both the values and the offsets differ. Therefore, it is easier to set
the token ids as attributes to `WhisperBeamSearch` when exporting to
ensure the right values are used in the timestamps processor.
- The Hugging Face API for returning timestamps and the expected outputs
from the PyTorch model have both changed.
- The fix for `torch.onnx.export` is a follow-up to [this PR
review](#17179 (comment)).
- The argument grouping is a follow-up to [this PR
review](#17500 (comment)).
- Specific package versions are needed to run the Whisper scripts and
the `requirements.txt` file ensures that these versions are installed.
- The `whisper-large-v3` variant is released and should be in the list
of official pretrained models.
- After the changes from [this
PR](#17316), the exported
model is not loading in an ORT inference session because the
cross-attention KV cache inputs are missing in the decoder subgraph.
@YUNQIUGUO
Copy link
Contributor Author

so far the cherry-picks for the patch should all be ready in there. had to resolve some merge conflicts when merging #19509 please review it's still correct with what's included in this pr. thanks! cc @kunal-vaishnavi (specifically - two files whisper_chain.py and whisper_helper.py)

@fs-eire
Copy link
Contributor

fs-eire commented Feb 20, 2024

Seems #19274 is marked for 1.17.1 but still waiting for approval.

This pull request includes modifications to the `c-api-cpu.yml` Azure
Pipelines configuration file. The changes mainly revolve around the
Node.js packaging stage and the handling of Node.js artifacts. The most
significant changes include renaming the Node.js packaging stage, adding
a new dependency to the stage, changing artifact names, adding a new
script to list Node.js artifacts, and updating the source folder for
copying NuGet binaries.

Changes in Node.js packaging:

*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L503-R508):
Renamed the Node.js packaging stage from `Nodejs_Packaging_CPU` to
`Nodejs_Packaging` and added `Windows_CI_GPU_DML_Dev` as a new
dependency to the stage.

Changes in handling of Node.js artifacts:

*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L568-R569):
Changed the artifact name from `drop-onnxruntime-nodejs-win-x64` to
`drop-onnxruntime-nodejs-win-x64-dml` in the task to download pipeline
artifacts for Windows x64.
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59R595-R598):
Added a new script to list Node.js artifacts from the directory
`$(Build.BinariesDirectory)/nodejs-artifacts/win32/x64/`.
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L635-R640):
Updated the source folder from
`$(Build.BinariesDirectory)\RelWithDebInfo\RelWithDebInfo\nuget-artifacts\onnxruntime-win-x64\lib`
to `$(Build.BinariesDirectory)\nodejs-artifacts\win32\x64` in the task
to copy NuGet binaries to the directory
`$(Build.SourcesDirectory)\js\node\bin\napi-v3\win32\x64`.

---------

Co-authored-by: Yulong Wang <[email protected]>
@YUNQIUGUO
Copy link
Contributor Author

Seems #19274 is marked for 1.17.1 but still waiting for approval.

thanks for mentioning this. added the commit.

…antize (#19455)

### Description
The current quantization tool relies on shape inference to provide the
type of every intermediate tensor, then the tool knows which type it
must dequantize into (float32, float16). However, this information is
not available if shape inference fails. That happens every time the
model include an operator from a custom domain such as com.microsoft.

This PR introduces an extra option `DefaultTensorType` as a fall back
when the quantizer cannot find the type it needs.

### Motivation and Context
This fixes issue #19409.
@smk2007
Copy link
Member

smk2007 commented Feb 20, 2024

Please add #19574, TY!

…19574)

Disable __cpuid check on arm64 builds as intrinsic is not available

Motivation
Breaking the arm64 build.

Co-authored-by: Sheil Kumar <[email protected]>
@YUNQIUGUO YUNQIUGUO requested review from fs-eire and removed request for fs-eire February 21, 2024 03:17
…#19577)

### Description
<!-- Describe your changes. -->
Accept the command line option --symmetric and its optional value
correctly. If the optional value matches uncased to 'True' then set
symmetric to True else set symmetric to False. Asymmetric quantization
will generate zero_point input.
```
usage: matmul_4bits_quantizer.py [-h] --input_model INPUT_MODEL --output_model OUTPUT_MODEL [--block_size BLOCK_SIZE] [--symmetric [{True,False}]] [--accuracy_level ACCURACY_LEVEL] [-v]
                                 [--nodes_to_exclude NODES_TO_EXCLUDE [NODES_TO_EXCLUDE ...]]
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@YUNQIUGUO YUNQIUGUO merged commit 75968b9 into rel-1.17.1 Feb 21, 2024
107 of 172 checks passed
@YUNQIUGUO YUNQIUGUO deleted the yguo/cherry-pick-1.17.1 branch February 21, 2024 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.