Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ClipQuantFusion exclusively on CPU EP #20627

Merged
merged 1 commit into from
May 10, 2024

Conversation

yihonglyu
Copy link
Contributor

@yihonglyu yihonglyu commented May 9, 2024

Motivation and Context

The Intel NPU does not support 16-bit int quantized operators. Consequently, the execution provider removes the QuantizeLinear/DeQuantizeLinear (Q/DQ) operators from node units and executes the operation as FP16 in the backend. However, if a Clip operator was fused into a Q operator in the node unit, the removal of Q/DQ operators results in inaccuracies because the effect of the original Clip operators is lost.

Consider the following example:

  • FP32 model: -> Op_FP32 -> Clip ->
  • QDQ model: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Clip -> Q') ->
  • After ClipQuantFusion: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Q') ->
  • Intel Execution Provider strips Q/DQ: -> Op_FP16 ->

To solve this issue, we have enabled ClipQuantFusion exclusively on the CPU execution provider.

@yihonglyu yihonglyu changed the title Enable ClipQuantFusion on cpu only Enable ClipQuantFusion exclusively on CPU execution provider May 9, 2024
@yihonglyu yihonglyu changed the title Enable ClipQuantFusion exclusively on CPU execution provider Enable ClipQuantFusion exclusively on CPU EP May 9, 2024
@yihonglyu yihonglyu marked this pull request as ready for review May 9, 2024 22:56
@yihonglyu yihonglyu merged commit 49d197a into main May 10, 2024
95 checks passed
@yihonglyu yihonglyu deleted the yilyu/clip-quant-fusion-on-cpu-only branch May 10, 2024 23:07
poweiw pushed a commit to poweiw/onnxruntime that referenced this pull request Jun 25, 2024
### Motivation and Context

The Intel NPU does not support 16-bit int quantized operators.
Consequently, the execution provider removes the
QuantizeLinear/DeQuantizeLinear (Q/DQ) operators from node units and
executes the operation as FP16 in the backend. However, if a Clip
operator was fused into a Q operator in the node unit, the removal of
Q/DQ operators results in inaccuracies because the effect of the
original Clip operators is lost.

Consider the following example:
- FP32 model: -> Op_FP32 -> Clip ->
- QDQ model: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Clip -> Q') ->
- After ClipQuantFusion: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Q') ->
- Intel Execution Provider strips Q/DQ: -> Op_FP16 ->

To solve this issue, we have enabled ClipQuantFusion exclusively on the
CPU execution provider.
adrianlizarraga added a commit that referenced this pull request Jul 19, 2024
### Description
Moves the `Relu -> QuantizeLinear` fusion to Level2 optimizations for
CPU EP only.

### Motivation and Context
See the related PR for motivation and context:
#20627
cloudhan added a commit that referenced this pull request Oct 24, 2024
cloudhan added a commit that referenced this pull request Oct 28, 2024
cloudhan added a commit that referenced this pull request Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants