Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support block-wise quantization #779

Open
huningxin opened this issue Nov 6, 2024 · 2 comments · May be fixed by #805
Open

Support block-wise quantization #779

huningxin opened this issue Nov 6, 2024 · 2 comments · May be fixed by #805

Comments

@huningxin
Copy link
Contributor

Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization. It is used for popular language models, such as phi-3 mini int4 quantized model.

Native ML API's support

DML DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE introduced in Feature Level 6.3
CoreML constexpr_blockwise_shift_scale
TFLite: ?

Proposal

No API signature changes regarding to @fdwr 's proposal of dequantizeLinear and quantizeLinear ops.

MLOperand dequantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});
MLOperand quantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});

The block_size is an integer and implied by block_size = input_size / scale_size (where input_size % scale_size == 0) along a dimension. zeroPoint and scale should have the same shape.

@fdwr
Copy link
Collaborator

fdwr commented Nov 7, 2024

Thanks for the paper link. I'd be surprised if TFLite didn't have some blockwise support somewhere, but if not, it might need decomposition (e.g. scale and zeroPoint blockwise expanded up to the input shape via tf.tile or tf.repeats or tf.imaging.resize or some other similar function, then dq = (input - zeroPoint) * scale).

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Nov 9, 2024
Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
aarongable pushed a commit to chromium/chromium that referenced this issue Nov 9, 2024
Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <[email protected]>
Reviewed-by: ningxin hu <[email protected]>
Commit-Queue: ningxin hu <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1380767}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Nov 9, 2024
Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <[email protected]>
Reviewed-by: ningxin hu <[email protected]>
Commit-Queue: ningxin hu <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1380767}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Nov 9, 2024
Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <[email protected]>
Reviewed-by: ningxin hu <[email protected]>
Commit-Queue: ningxin hu <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1380767}
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Nov 10, 2024
…or DirectML backend, a=testonly

Automatic update from web-platform-tests
webnn: Support block-wise quantization for DirectML backend

Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try​:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <[email protected]>
Reviewed-by: ningxin hu <[email protected]>
Commit-Queue: ningxin hu <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1380767}

--

wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85
wpt-pr: 49083
jamienicol pushed a commit to jamienicol/gecko that referenced this issue Nov 12, 2024
…or DirectML backend, a=testonly

Automatic update from web-platform-tests
webnn: Support block-wise quantization for DirectML backend

Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try​:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <[email protected]>
Reviewed-by: ningxin hu <[email protected]>
Commit-Queue: ningxin hu <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1380767}

--

wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85
wpt-pr: 49083
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Nov 16, 2024
…or DirectML backend, a=testonly

Automatic update from web-platform-tests
webnn: Support block-wise quantization for DirectML backend

Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try​:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <rafael.cintronmicrosoft.com>
Reviewed-by: ningxin hu <ningxin.huintel.com>
Commit-Queue: ningxin hu <ningxin.huintel.com>
Cr-Commit-Position: refs/heads/main{#1380767}

--

wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85
wpt-pr: 49083

UltraBlame original commit: 6b8a19bf1f5562bfae60549575af9c2b422b4975
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Nov 16, 2024
…or DirectML backend, a=testonly

Automatic update from web-platform-tests
webnn: Support block-wise quantization for DirectML backend

Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try​:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <rafael.cintronmicrosoft.com>
Reviewed-by: ningxin hu <ningxin.huintel.com>
Commit-Queue: ningxin hu <ningxin.huintel.com>
Cr-Commit-Position: refs/heads/main{#1380767}

--

wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85
wpt-pr: 49083

UltraBlame original commit: 6b8a19bf1f5562bfae60549575af9c2b422b4975
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue Nov 16, 2024
…or DirectML backend, a=testonly

Automatic update from web-platform-tests
webnn: Support block-wise quantization for DirectML backend

Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try​:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <rafael.cintronmicrosoft.com>
Reviewed-by: ningxin hu <ningxin.huintel.com>
Commit-Queue: ningxin hu <ningxin.huintel.com>
Cr-Commit-Position: refs/heads/main{#1380767}

--

wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85
wpt-pr: 49083

UltraBlame original commit: 6b8a19bf1f5562bfae60549575af9c2b422b4975
i3roly pushed a commit to i3roly/firefox-dynasty that referenced this issue Nov 16, 2024
…or DirectML backend, a=testonly

Automatic update from web-platform-tests
webnn: Support block-wise quantization for DirectML backend

Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try​:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <[email protected]>
Reviewed-by: ningxin hu <[email protected]>
Commit-Queue: ningxin hu <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1380767}

--

wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85
wpt-pr: 49083
@inexorabletash inexorabletash linked a pull request Jan 17, 2025 that will close this issue
7 tasks
@fdwr
Copy link
Collaborator

fdwr commented Feb 27, 2025

One realization while decomposing the emulation for dequantizeLinear in #805 is that it's quite verbose (see below). The entire thing could be reduced to just the dequantizeLinear function by using expand if expand was augmented to accept any from-shape that was an integer multiple of the to-shape (or alternately via nearest-neighbor upsampling if resample was augmented from 2D to ND, like ONNX Resize). Blockwise expansion seems useful as a standalone concept (independent of quantization) at least for clean decomposability. Will create separate issue...

Current emulation code 🥲:

    function dequantizeLinear(builder, input, scale, zeroPoint, options) {
      // output = (input - zeroPoint) * scale
      const floatInput = builder.cast(input, scale.dataType);
      const floatZeroPoint = builder.cast(zeroPoint, scale.dataType);
      const upsampledScale = blockwiseExpand(builder, scale, input.shape);
      const upsampledZeroPoint = blockwiseExpand(builder, floatZeroPoint, input.shape);
      return builder.mul(builder.sub(floatInput, upsampledZeroPoint), upsampledScale);
    }

    function blockwiseExpand(builder, input, targetShape) {
      // This expands each axis by repeating the block the number of times per that axis, given the
      // original input shape and target shape. However, backend implementations might have much more
      // efficient upsampling operators that can accept multiple dimensions to upsample all
      // dimensions at once by integer multiples (like tile) using nearest neighbor resampling:
      // output = resample(scale, {sizes: input.shape})

      let expandedInput = input;

      for (let axis = 0; axis < input.shape.length; ++axis) {
        const inputShape = expandedInput.shape;
        const oldDimensionLength = inputShape[axis];
        const newDimensionLength = targetShape[axis];

        if (newDimensionLength != oldDimensionLength) {
          // Since tile/expand can only accept repetitions of entire dimension slices (not repeating
          // individual elements along an axis), temporarily reshape the tensor to enable them to broadcast
          // the elements up to the full block size, utilizing an inserted dimension of size 1.
          const elementRepeatCount = newDimensionLength / oldDimensionLength;
          const flattenedShape = getFlattenedShapeAroundAxis(inputShape, axis);
          const unexpandedShape = [flattenedShape[0], flattenedShape[1], 1, flattenedShape[2]];
          const expandedShape = [flattenedShape[0], flattenedShape[1], elementRepeatCount, flattenedShape[2]];
          const reshapedInput = builder.reshape(expandedInput, unexpandedShape);
          expandedInput = builder.expand(reshapedInput, expandedShape);
        }

        let newInputShape = [...inputShape];
        newInputShape[axis] = newDimensionLength;
        expandedInput = builder.reshape(expandedInput, newInputShape);
      }

      return expandedInput;
    }

    // Compute the flattened shape before and after the given axis, yielding a 3-element list.
    // e.g. inputShape = [2,3,4,5,6] with axis = 2 yields shape [6,4,30].
    // e.g. inputShape = [4] with axis = 0 yields shape [1,4,1].
    function getFlattenedShapeAroundAxis(inputShape, axis) {
      axis = Math.max(Math.min(axis, input.shape.length - 1), 0);
      const countBefore = axis.slice(0, axis).reduce((a, b) => a * b, 1);
      const countAfter = axis.slice(axis + 1, input.shape.length).reduce((a, b) => a * b, 1);
      return [countBefore, inputShape[axis], countAfter];
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants