Skip to content

v3.3-rc

Pre-release
Pre-release
Compare
Choose a tag to compare
@harrymao2022 harrymao2022 released this 19 Sep 18:21
· 105 commits to rls-v3.3 since this release

Performance Optimizations

  • Intel Architecture Processors:
    • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
    • Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
    • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
    • Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
    • Improved s32 binary primitive performance.
    • Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
    • Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
    • Improved performance of convolution for depthwise cases with Graph API.
    • [experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
  • Intel Graphics Products:
    • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
    • Reduced RNN primitive initialization time on Intel GPUs.
  • AArch64-based Processors:
    • Improved fp32 to bf16 reorder performance.
    • Improved max pooling performance with Arm Compute Library (ACL).
    • Improved dilated convolution performance for depthwise cases with ACL.

Functionality

  • Introduced group normalization primitive support. The functionality is currently available on CPUs.
  • Intel CPUs:
    • Introduced support for zero points in int8 convolution with groups and 3D spatial.

Usability

  • Extended verbose mode output:
    • Improved diagnostics on engine creation errors.
    • Added information on Graph API calls.
    • Added information on strides for non-dense memory objects.
    • Added values of runtime dimension.
    • Added indication that primitive descriptor was created with any memory format tag.
  • Introduced examples for Graph API.
  • Graph API constant tensor cache is now disabled by default and requires opt-in with dnnl::graph::set_constant_tensor_cache() call.
  • Reduced oneDNN Graph API memory consumption in certain scenarios.

Validation

  • Extended benchdnn performance reporting with primitive creation time.
  • Introduced cold cache mode in benchdnn.

Thanks to these Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, @baibeta, Benjamin Taylor @bentaylorhk-arm, Ilya Lavrenov @ilya-lavrenov, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, Renato Barros Arantes @renato-arantes, @snadampal, @sparkyrider, and Thomas Köppe @tkoeppe. We would also like to thank everyone who asked questions and reported issues.