Skip to content

Releases: oneapi-src/oneDNN

v3.5-rc

28 May 19:36
Compare
Choose a tag to compare
v3.5-rc Pre-release
Pre-release

This is a release candidate for oneDNN v3.5. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

  • Intel Architecture Processors:

    • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
    • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
    • Improved performance of group normalization primitive.
    • Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
    • Improved performance of the following subgraphs with Graph API:
      • Multi-Query Attention (MQA).
      • Scaled Dot Product Attention (SDPA), including the variant with select operation.
      • LayerNorm + Multiply + Quantize produced by SmoothQuant algorithm.
      • Convolution + Sigmoid + Multiply with mixed precisions.
  • Intel Graphics Products:

    • Improved performance for Processor Graphics based on Xe2 architecture.
    • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
    • Improved RNN primitive performance for LSTM cell case.
    • Improved performance of f8_e4m3 data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
  • AArch64-based Processors:

    • Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
    • Improved bf16 matmul performance with Arm Compute Library (ACL).
    • Improved eltwise primitive performance with gelu_erf algorithm with ACL.

Functionality

  • Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
  • Introduced support for int4 data type and extended quantization model with support for grouped scales and zero points.
  • Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs only.
  • Extended floating point math mode API to support weight decompression scenarios. See matmul weights decompression example to get started. New floating mode is supported in the following configurations:
    • bfloat16 matmul with int8 weights on Intel CPUs.
    • float16 and bfloat16 matmul with int8 or int4 weights on Intel GPUs.
  • [experimental] Introduced microkernel API for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.

Usability

  • Extended error messages for engine and memory objects creation errors.
  • Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
  • Introduced support for clang++ host compiler in SYCL builds.
  • Introduced API for tensor serialization and deserialization.
  • Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
  • Introduced OpenCL runtime support for Graph API.
  • Added support for building oneDNN with installed Arm Compute Library (ACL).

Validation

  • Extended benchdnn with support for tensor tags in RNN primitive validation.

Thanks to these Contributors

This release contains contributions from the project core team as well as @AngryLoki, Crefeda Rodrigues @cfRod, Daniel Richard G. @iskunk, @deepeshfujitsu, Dylan Angus @dylan-angus-codeplay, Emanuele Rocca @ema, Hernan Martinez @hmartinez82, John Osorio @kala855, Jonathan Deakin @jondea, @kasturedeeksha, Kentaro Kawakami @kawakami-k, Nikita Shulga @malfet, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Roman Zhukov @rozhukov, Shreyas-fuj @Shreyas-fuj, Sunita Nadampalli @snadampal, Tadej Ciglarič @t4c1, Vineel Abhinav @vineelabhinav, @vishwascm. We would also like to thank everyone who asked questions and reported issues.

v3.4.2

10 May 22:02
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.4.1:

  • Fixed performance regression in deconvolution on processors with Intel AVX-512 instruction set (307b35b, f46fffb)
  • Improved performance of batched matmul with binary post-op on processors with Intel AVX-512 instruction set (d39e1b7)
  • Fixed performance regression in softmax with destination memory format set to any on processors with Intel AVX-512 instruction set (756d3cf)
  • Fixed incorrect results in int8 deconvolution with source zero points on processors with Intel AMX instruction set (d5ddbc8)
  • Fixed performance regression in convolution on processors with Intel AVX2 instruction set (2968c89)
  • Improved f8_e4m3 matmul performance on Intel Data Center GPU Max Series (068f850, 668abae, c3972ef, ad94382)
  • Fixed sporadic accuracy issues in bf16 depthwise convolution backpropagation on processors with Intel AVX-512 instruction set (0184044)
  • Fixed primitive creation issue for fp16 pooling backpropagation on Intel GPUs (e4737d9)
  • Fixed failure for subgraphs with int8 matmul operation with experimental Graph Compiler on processors with Intel AMX instruction set (5ebde2e)
  • Fixed assert in experimental Graph Compiler on Windows (f53fbd1, fd903ae)
  • Fixed incorrect results for subgraphs with shuffle operation with experimental Graph Compiler (aef5023)
  • Improved performance of subgraphs involving int8 matmul with experimental Graph Compiler on processors with Intel AMX support (0ca5bc5)
  • Fixed page fault in fp16 matmul primitive on Intel Data Center GPU Max Series (5587f08)
  • Fixed incorrect results in dp32 deconvolution with Arm Compute Library on AArch64 processors (b7694a0)
  • Fixed performance regression in deconvolution on processors with Intel AVX2 instruction set (6f452e2)

v3.4.1

29 Mar 22:27
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.4:

  • Fixed an issue with caching and serialization of primitives in deterministic mode (7ed604a)
  • Introduced memory descriptor serialization API (4cad420, 929a27a, 9b848c8)
  • Fixed incorrect results in fp64 convolution and deconvolution on Intel GPUs based on Xe-LPG architecture (ebe77b5, 0b399ac, d748d64, 9f4f3d5, 21a8cae)
  • Fixed incorrect results in reorder with large sizes on Intel CPUs and GPUs (69a111e, 4b72361, 74a343b)
  • Reduced creation time for deconvolution primitive on Intel CPUs (bec487e, 1eab005)
  • Fixed performance regression in deconvolution on Intel CPUs (fbe5b97, 1dd3c6a)
  • Removed dangling symblols from static builds (e92c404, 6f5621a)
  • Fixed crash during platform detection on some AArch64-based systems (406a079)
  • Fixed performance regression in int8 deconvolution on Intel CPUs (7e50e15)
  • Fixed handling of zero points for matmul in verbose logs converter (15c7916)

v3.3.6

18 Mar 15:42
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.3.5:

  • Fixed crash during platform detection on some AArch64-based systems (3e0e69b)
  • Improved inner product performance with Arm Compute Library (ACL) (e7abee2, 214fb9e, 8aacc8f)
  • Fixed incorrect results in int8 depthwise convolution with post-ops on processors with Intel AVX2 instruction set support (0c922e0)
  • Fixed performance regression in fp32 convolution on processors with Intel AVX2 instruction set support (4efc0ad)

v3.4

01 Mar 02:11
Compare
Choose a tag to compare

Performance Optimizations

  • Intel Architecture Processors:

    • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
    • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
    • Improved RNN primitive performance with LBR_GRU cell.
    • Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
    • Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
    • Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
    • Improved int8 matmul performance with transposed A tensor.
    • Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
    • Improved performance of int8 convolution with post-ops.
    • Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
    • Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
    • Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
    • [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
    • [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
  • Intel Graphics Products:

    • Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
    • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
    • Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
    • Improved convolution performance for cases relevant to the Stable Diffusion model.
    • Improved RNN primitive performance.
    • Improved pooling forward propagation performance.
    • Improved batched matmul performance for cases with 5 dimensions or more.
  • AArch64-based Processors:

    • Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
    • Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
    • Improved bf16 inner product product primitive performance with ACL.

Functionality

  • Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 weight compression.
  • Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
  • Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
  • [experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
  • Intel Graphics Products
    • Introduced support for Intel Data Center GPU Max 1550VG
    • Introduced PReLU post-op support for inner product and matmul primitives.

Usability

  • Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
  • Introduced accumulation mode control.
  • Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
  • Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
  • Reduced RNN primitive memory consumption on GPUs.
  • Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
  • Extended tensor constructor in Graph API to support memory allocation and management by the library.
  • Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
  • Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
  • Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

  • Improved benchdnn performance by optimizing bottlenecks in validation code.
  • Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Known Limitations

  • Intel Datacenter GPU Flex Series driver for Windows has an issue resulting in program hangs or crashes when oneDNN primitives are created concurrently.
  • int8 concat primitive may produce incorrect results on integrated GPUs with current GPU driver.
  • fp32 pooling primitive may produce incorrect results in rare conditions on Intel Datacenter GPU Max Series with current GPU driver.
  • reorder primitive causes segmentation fault for prime sizes exceeding 2^31 on Intel CPUs.
  • fp64 convolution and deconvolution produces incorrect results on integrated graphics in future Intel Core processors (code name Arrow Lake)
  • int8 matmul primitive creation with fp32 bias fails on Intel GPU Flex Series and Intel Arc Graphics.

Breaking Changes

  • Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.

v3.3.5

28 Feb 21:16
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.3.4:

  • Fixed undefined behavior in 3D depthwise convolution on Intel CPUs (bbaec14)
  • Added warning for ACL versions newer than maximum supported (7473012)
  • Added citation file (fea9f88)
  • Fixed SEGFAULT in int8 convolution on processors with Intel AMX support (2a8e122)

v3.4-rc

13 Feb 22:08
Compare
Choose a tag to compare
v3.4-rc Pre-release
Pre-release

Performance Optimizations

  • Intel Architecture Processors:

    • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
    • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
    • Improved RNN primitive performance with LBR_GRU cell.
    • Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
    • Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
    • Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
    • Improved int8 matmul performance with transposed A tensor.
    • Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
    • Improved performance of int8 convolution with post-ops.
    • Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
    • Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
    • Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
    • [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
    • [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
  • Intel Graphics Products:

    • Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
    • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
    • Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
    • Improved convolution performance for cases relevant to the Stable Diffusion model.
    • Improved RNN primitive performance.
    • Improved pooling forward propagation performance.
    • Improved batched matmul performance for cases with 5 dimensions or more.
  • AArch64-based Processors:

    • Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
    • Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
    • Improved bf16 inner product product primitive performance with ACL.

Functionality

  • Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 wight compression.
  • Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
  • Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
  • [experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
  • Intel Graphics Products
    • Introduced PReLU post-op support for inner product and matmul primitives.

Usability

  • Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
  • Introduced accumulation mode control.
  • Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
  • Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
  • Reduced RNN primitive memory consumption on GPUs.
  • Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
  • Extended tensor constructor in Graph API to support memory allocation and management by the library.
  • Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
  • Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
  • Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

  • Improved benchdnn performance by optimizing bottlenecks in validation code.
  • Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Breaking Changes

  • Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.

v3.3.4

08 Jan 23:49
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.3.3:

  • Fixed performance regression in convolution, matmul and inner product primitives with post-ops on Intel CPUs (2e3c94c)
  • Fixed performance regression in bfloat16 matmul on processors with Intel AMX instruction set support (c0ae38c, fa43640)
  • Fixed SEGFAULT in 3D convolutions with different h and w parameters on Intel CPUs (b5f916e)
  • Fixed performance regression in fp32 convolution backpropagation on Intel CPUs (ee3b12d)
  • Reduced benchdnn memory consumption on Intel GPUs (84a8f57)

v3.3.3

14 Dec 19:40
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.3.2:

  • Fixed performance regression in int8 convolutions on processors with Intel AVX-512 and Intel DL Boost support (a00661f)
  • Fixed race condition during library initialization on Intel Data Center GPU Max Series (7dfcd11)
  • Fixed accuracy issue in experimental Graph Compiler with LLVM code generator (8892e7e)
  • Disabled int8 RNN implementation for cases with non-trivial strides (2195e4b)
  • Fixed incorrect results in bfloat16 convolution implementation on processors with Intel AMX support (9f00af9)
  • Fixed incorrect results in fp16 and int8 convolution on Intel Core Ultra integrated GPUs (69cef84, 79bc6cc, c9c0b09)

v3.3.2

30 Nov 15:55
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.3.1:

  • Fixed incorrect results in bfloat16 reorder on Intel Core Ultra integrates GPUs (9025980, ed9de2a, 0c6bda1)
  • Fixed incorrect results in matmul, inner product, and RNN primitives on Intel Core Ultra integrated GPUs (6edab9f)
  • Updated compiler optimization flags for AArch64 processors to make build portable (8829c24)
  • Fixed segmentation fault during library initialization on AArch64 processors (3e15c61)