Support block-wise quantization #779

huningxin · 2024-11-06T02:31:00Z

Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization. It is used for popular language models, such as phi-3 mini int4 quantized model.

Native ML API's support

DML DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE introduced in Feature Level 6.3
CoreML constexpr_blockwise_shift_scale
TFLite: ?

Proposal

No API signature changes regarding to @fdwr 's proposal of dequantizeLinear and quantizeLinear ops.

MLOperand dequantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});
MLOperand quantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});

The block_size is an integer and implied by block_size = input_size / scale_size (where input_size % scale_size == 0) along a dimension. zeroPoint and scale should have the same shape.

The text was updated successfully, but these errors were encountered:

fdwr · 2024-11-07T04:15:01Z

Thanks for the paper link. I'd be surprised if TFLite didn't have some blockwise support somewhere, but if not, it might need decomposition (e.g. scale and zeroPoint blockwise expanded up to the input shape via tf.tile or tf.repeats or tf.imaging.resize or some other similar function, then dq = (input - zeroPoint) * scale).

Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization [1]. It is used for popular language models, such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has been opened to discussion. Firstly, this CL validates scale and zero point tensors for block-wise quantization. Besides, this CL also implements the block-wise quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3. More validation and conformance tests are added to verify the implementation. [1]: https://arxiv.org/abs/2110.02861 [2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct [3]: webmachinelearning/webnn#779 Bug: 40206287 Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel

Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization [1]. It is used for popular language models, such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has been opened to discussion. Firstly, this CL validates scale and zero point tensors for block-wise quantization. Besides, this CL also implements the block-wise quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3. More validation and conformance tests are added to verify the implementation. [1]: https://arxiv.org/abs/2110.02861 [2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct [3]: webmachinelearning/webnn#779 Bug: 40206287 Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816 Reviewed-by: Rafael Cintron <[email protected]> Reviewed-by: ningxin hu <[email protected]> Commit-Queue: ningxin hu <[email protected]> Cr-Commit-Position: refs/heads/main@{#1380767}

…or DirectML backend, a=testonly Automatic update from web-platform-tests webnn: Support block-wise quantization for DirectML backend Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization [1]. It is used for popular language models, such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has been opened to discussion. Firstly, this CL validates scale and zero point tensors for block-wise quantization. Besides, this CL also implements the block-wise quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3. More validation and conformance tests are added to verify the implementation. [1]: https://arxiv.org/abs/2110.02861 [2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct [3]: webmachinelearning/webnn#779 Bug: 40206287 Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816 Reviewed-by: Rafael Cintron <[email protected]> Reviewed-by: ningxin hu <[email protected]> Commit-Queue: ningxin hu <[email protected]> Cr-Commit-Position: refs/heads/main@{#1380767} -- wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85 wpt-pr: 49083

…or DirectML backend, a=testonly Automatic update from web-platform-tests webnn: Support block-wise quantization for DirectML backend Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization [1]. It is used for popular language models, such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has been opened to discussion. Firstly, this CL validates scale and zero point tensors for block-wise quantization. Besides, this CL also implements the block-wise quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3. More validation and conformance tests are added to verify the implementation. [1]: https://arxiv.org/abs/2110.02861 [2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct [3]: webmachinelearning/webnn#779 Bug: 40206287 Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816 Reviewed-by: Rafael Cintron <rafael.cintronmicrosoft.com> Reviewed-by: ningxin hu <ningxin.huintel.com> Commit-Queue: ningxin hu <ningxin.huintel.com> Cr-Commit-Position: refs/heads/main{#1380767} -- wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85 wpt-pr: 49083 UltraBlame original commit: 6b8a19bf1f5562bfae60549575af9c2b422b4975

…or DirectML backend, a=testonly Automatic update from web-platform-tests webnn: Support block-wise quantization for DirectML backend Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization [1]. It is used for popular language models, such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has been opened to discussion. Firstly, this CL validates scale and zero point tensors for block-wise quantization. Besides, this CL also implements the block-wise quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3. More validation and conformance tests are added to verify the implementation. [1]: https://arxiv.org/abs/2110.02861 [2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct [3]: webmachinelearning/webnn#779 Bug: 40206287 Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816 Reviewed-by: Rafael Cintron <[email protected]> Reviewed-by: ningxin hu <[email protected]> Commit-Queue: ningxin hu <[email protected]> Cr-Commit-Position: refs/heads/main@{#1380767} -- wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85 wpt-pr: 49083

fdwr · 2025-02-27T22:06:26Z

One realization while decomposing the emulation for dequantizeLinear in #805 is that it's quite verbose (see below). The entire thing could be reduced to just the dequantizeLinear function by using expand if expand was augmented to accept any from-shape that was an integer multiple of the to-shape (or alternately via nearest-neighbor upsampling if resample was augmented from 2D to ND, like ONNX Resize). Blockwise expansion seems useful as a standalone concept (independent of quantization) at least for clean decomposability. Will create separate issue...

Current emulation code 🥲:

    function dequantizeLinear(builder, input, scale, zeroPoint, options) {
      // output = (input - zeroPoint) * scale
      const floatInput = builder.cast(input, scale.dataType);
      const floatZeroPoint = builder.cast(zeroPoint, scale.dataType);
      const upsampledScale = blockwiseExpand(builder, scale, input.shape);
      const upsampledZeroPoint = blockwiseExpand(builder, floatZeroPoint, input.shape);
      return builder.mul(builder.sub(floatInput, upsampledZeroPoint), upsampledScale);
    }

    function blockwiseExpand(builder, input, targetShape) {
      // This expands each axis by repeating the block the number of times per that axis, given the
      // original input shape and target shape. However, backend implementations might have much more
      // efficient upsampling operators that can accept multiple dimensions to upsample all
      // dimensions at once by integer multiples (like tile) using nearest neighbor resampling:
      // output = resample(scale, {sizes: input.shape})

      let expandedInput = input;

      for (let axis = 0; axis < input.shape.length; ++axis) {
        const inputShape = expandedInput.shape;
        const oldDimensionLength = inputShape[axis];
        const newDimensionLength = targetShape[axis];

        if (newDimensionLength != oldDimensionLength) {
          // Since tile/expand can only accept repetitions of entire dimension slices (not repeating
          // individual elements along an axis), temporarily reshape the tensor to enable them to broadcast
          // the elements up to the full block size, utilizing an inserted dimension of size 1.
          const elementRepeatCount = newDimensionLength / oldDimensionLength;
          const flattenedShape = getFlattenedShapeAroundAxis(inputShape, axis);
          const unexpandedShape = [flattenedShape[0], flattenedShape[1], 1, flattenedShape[2]];
          const expandedShape = [flattenedShape[0], flattenedShape[1], elementRepeatCount, flattenedShape[2]];
          const reshapedInput = builder.reshape(expandedInput, unexpandedShape);
          expandedInput = builder.expand(reshapedInput, expandedShape);
        }

        let newInputShape = [...inputShape];
        newInputShape[axis] = newDimensionLength;
        expandedInput = builder.reshape(expandedInput, newInputShape);
      }

      return expandedInput;
    }

    // Compute the flattened shape before and after the given axis, yielding a 3-element list.
    // e.g. inputShape = [2,3,4,5,6] with axis = 2 yields shape [6,4,30].
    // e.g. inputShape = [4] with axis = 0 yields shape [1,4,1].
    function getFlattenedShapeAroundAxis(inputShape, axis) {
      axis = Math.max(Math.min(axis, input.shape.length - 1), 0);
      const countBefore = axis.slice(0, axis).reduce((a, b) => a * b, 1);
      const countAfter = axis.slice(axis + 1, input.shape.length).reduce((a, b) => a * b, 1);
      return [countBefore, inputShape[axis], countAfter];
    }

anssiko added the operator specific label Nov 6, 2024

fdwr mentioned this issue Nov 7, 2024

Add QuantizeLinear and DequantizeLinear for mixed precision #93

Open

chromium-wpt-export-bot mentioned this issue Nov 9, 2024

webnn: Support block-wise quantization for DirectML backend web-platform-tests/wpt#49083

Merged

inexorabletash linked a pull request Jan 17, 2025 that will close this issue

Operator set wave 3 #805

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support block-wise quantization #779

Support block-wise quantization #779

huningxin commented Nov 6, 2024

fdwr commented Nov 7, 2024 •

edited

Loading

fdwr commented Feb 27, 2025 •

edited

Loading

Support block-wise quantization #779

Support block-wise quantization #779

Comments

huningxin commented Nov 6, 2024

Native ML API's support

Proposal

fdwr commented Nov 7, 2024 • edited Loading

fdwr commented Feb 27, 2025 • edited Loading

Current emulation code 🥲:

fdwr commented Nov 7, 2024 •

edited

Loading

fdwr commented Feb 27, 2025 •

edited

Loading