28 May 19:36

vpirogov

b9d1b4c

v3.5-rc Pre-release

Pre-release

This is a release candidate for oneDNN v3.5. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
- Improved performance of group normalization primitive.
- Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
- Improved performance of the following subgraphs with Graph API:
  - Multi-Query Attention (MQA).
  - Scaled Dot Product Attention (SDPA), including the variant with select operation.
  - LayerNorm + Multiply + Quantize produced by SmoothQuant algorithm.
  - Convolution + Sigmoid + Multiply with mixed precisions.
Intel Graphics Products:
- Improved performance for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved RNN primitive performance for LSTM cell case.
- Improved performance of f8_e4m3 data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
AArch64-based Processors:
- Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
- Improved bf16 matmul performance with Arm Compute Library (ACL).
- Improved eltwise primitive performance with gelu_erf algorithm with ACL.

Functionality

Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
Introduced support for int4 data type and extended quantization model with support for grouped scales and zero points.
Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs only.
Extended floating point math mode API to support weight decompression scenarios. See matmul weights decompression example to get started. New floating mode is supported in the following configurations:
- bfloat16 matmul with int8 weights on Intel CPUs.
- float16 and bfloat16 matmul with int8 or int4 weights on Intel GPUs.
[experimental] Introduced microkernel API for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.

Usability

Extended error messages for engine and memory objects creation errors.
Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
Introduced support for clang++ host compiler in SYCL builds.
Introduced API for tensor serialization and deserialization.
Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
Introduced OpenCL runtime support for Graph API.
Added support for building oneDNN with installed Arm Compute Library (ACL).

Validation

Extended benchdnn with support for tensor tags in RNN primitive validation.

Thanks to these Contributors

This release contains contributions from the project core team as well as @AngryLoki, Crefeda Rodrigues @cfRod, Daniel Richard G. @iskunk, @deepeshfujitsu, Dylan Angus @dylan-angus-codeplay, Emanuele Rocca @ema, Hernan Martinez @hmartinez82, John Osorio @kala855, Jonathan Deakin @jondea, @kasturedeeksha, Kentaro Kawakami @kawakami-k, Nikita Shulga @malfet, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Roman Zhukov @rozhukov, Shreyas-fuj @Shreyas-fuj, Sunita Nadampalli @snadampal, Tadej Ciglarič @t4c1, Vineel Abhinav @vineelabhinav, @vishwascm. We would also like to thank everyone who asked questions and reported issues.

Contributors

AngryLoki, ema, and 18 other contributors

Assets 2

10 May 22:02

vpirogov

v3.4.2

1137e04

v3.4.2

This is a patch release containing the following changes to v3.4.1:

Fixed performance regression in deconvolution on processors with Intel AVX-512 instruction set (307b35b, f46fffb)
Improved performance of batched matmul with binary post-op on processors with Intel AVX-512 instruction set (d39e1b7)
Fixed performance regression in softmax with destination memory format set to any on processors with Intel AVX-512 instruction set (756d3cf)
Fixed incorrect results in int8 deconvolution with source zero points on processors with Intel AMX instruction set (d5ddbc8)
Fixed performance regression in convolution on processors with Intel AVX2 instruction set (2968c89)
Improved f8_e4m3 matmul performance on Intel Data Center GPU Max Series (068f850, 668abae, c3972ef, ad94382)
Fixed sporadic accuracy issues in bf16 depthwise convolution backpropagation on processors with Intel AVX-512 instruction set (0184044)
Fixed primitive creation issue for fp16 pooling backpropagation on Intel GPUs (e4737d9)
Fixed failure for subgraphs with int8 matmul operation with experimental Graph Compiler on processors with Intel AMX instruction set (5ebde2e)
Fixed assert in experimental Graph Compiler on Windows (f53fbd1, fd903ae)
Fixed incorrect results for subgraphs with shuffle operation with experimental Graph Compiler (aef5023)
Improved performance of subgraphs involving int8 matmul with experimental Graph Compiler on processors with Intel AMX support (0ca5bc5)
Fixed page fault in fp16 matmul primitive on Intel Data Center GPU Max Series (5587f08)
Fixed incorrect results in dp32 deconvolution with Arm Compute Library on AArch64 processors (b7694a0)
Fixed performance regression in deconvolution on processors with Intel AVX2 instruction set (6f452e2)

Assets 2

29 Mar 22:27

vpirogov

v3.4.1

f5ff0a6

v3.4.1

This is a patch release containing the following changes to v3.4:

Fixed an issue with caching and serialization of primitives in deterministic mode (7ed604a)
Introduced memory descriptor serialization API (4cad420, 929a27a, 9b848c8)
Fixed incorrect results in fp64 convolution and deconvolution on Intel GPUs based on Xe-LPG architecture (ebe77b5, 0b399ac, d748d64, 9f4f3d5, 21a8cae)
Fixed incorrect results in reorder with large sizes on Intel CPUs and GPUs (69a111e, 4b72361, 74a343b)
Reduced creation time for deconvolution primitive on Intel CPUs (bec487e, 1eab005)
Fixed performance regression in deconvolution on Intel CPUs (fbe5b97, 1dd3c6a)
Removed dangling symblols from static builds (e92c404, 6f5621a)
Fixed crash during platform detection on some AArch64-based systems (406a079)
Fixed performance regression in int8 deconvolution on Intel CPUs (7e50e15)
Fixed handling of zero points for matmul in verbose logs converter (15c7916)

Assets 2

18 Mar 15:42

vpirogov

v3.3.6

86e6af5

v3.3.6

This is a patch release containing the following changes to v3.3.5:

Fixed crash during platform detection on some AArch64-based systems (3e0e69b)
Improved inner product performance with Arm Compute Library (ACL) (e7abee2, 214fb9e, 8aacc8f)
Fixed incorrect results in int8 depthwise convolution with post-ops on processors with Intel AVX2 instruction set support (0c922e0)
Fixed performance regression in fp32 convolution on processors with Intel AVX2 instruction set support (4efc0ad)

Assets 2

01 Mar 02:11

vpirogov

v3.4

ecd7fb6

v3.4

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
- Improved RNN primitive performance with LBR_GRU cell.
- Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
- Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
- Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
- Improved int8 matmul performance with transposed A tensor.
- Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
- Improved performance of int8 convolution with post-ops.
- Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
- Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
- Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
- [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
- [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
Intel Graphics Products:
- Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
- Improved convolution performance for cases relevant to the Stable Diffusion model.
- Improved RNN primitive performance.
- Improved pooling forward propagation performance.
- Improved batched matmul performance for cases with 5 dimensions or more.
AArch64-based Processors:
- Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
- Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
- Improved bf16 inner product product primitive performance with ACL.

Functionality

Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 weight compression.
Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
[experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
Intel Graphics Products
- Introduced support for Intel Data Center GPU Max 1550VG
- Introduced PReLU post-op support for inner product and matmul primitives.

Usability

Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
Introduced accumulation mode control.
Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
Reduced RNN primitive memory consumption on GPUs.
Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
Extended tensor constructor in Graph API to support memory allocation and management by the library.
Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

Improved benchdnn performance by optimizing bottlenecks in validation code.
Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Known Limitations

Intel Datacenter GPU Flex Series driver for Windows has an issue resulting in program hangs or crashes when oneDNN primitives are created concurrently.
int8 concat primitive may produce incorrect results on integrated GPUs with current GPU driver.
fp32 pooling primitive may produce incorrect results in rare conditions on Intel Datacenter GPU Max Series with current GPU driver.
reorder primitive causes segmentation fault for prime sizes exceeding 2^31 on Intel CPUs.
fp64 convolution and deconvolution produces incorrect results on integrated graphics in future Intel Core processors (code name Arrow Lake)
int8 matmul primitive creation with fp32 bias fails on Intel GPU Flex Series and Intel Arc Graphics.

Breaking Changes

Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.

Contributors

Flamefire, ilya-lavrenov, and 9 other contributors

Assets 2

28 Feb 21:16

vpirogov

v3.3.5

03c2a02

v3.3.5

This is a patch release containing the following changes to v3.3.4:

Fixed undefined behavior in 3D depthwise convolution on Intel CPUs (bbaec14)
Added warning for ACL versions newer than maximum supported (7473012)
Added citation file (fea9f88)
Fixed SEGFAULT in int8 convolution on processors with Intel AMX support (2a8e122)

Assets 2

13 Feb 22:08

harrymao2022

v3.4-rc

8ad500e

v3.4-rc Pre-release

Pre-release

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
- Improved RNN primitive performance with LBR_GRU cell.
- Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
- Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
- Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
- Improved int8 matmul performance with transposed A tensor.
- Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
- Improved performance of int8 convolution with post-ops.
- Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
- Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
- Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
- [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
- [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
Intel Graphics Products:
- Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
- Improved convolution performance for cases relevant to the Stable Diffusion model.
- Improved RNN primitive performance.
- Improved pooling forward propagation performance.
- Improved batched matmul performance for cases with 5 dimensions or more.
AArch64-based Processors:
- Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
- Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
- Improved bf16 inner product product primitive performance with ACL.

Functionality

Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 wight compression.
Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
[experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
Intel Graphics Products
- Introduced PReLU post-op support for inner product and matmul primitives.

Usability

Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
Introduced accumulation mode control.
Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
Reduced RNN primitive memory consumption on GPUs.
Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
Extended tensor constructor in Graph API to support memory allocation and management by the library.
Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

Improved benchdnn performance by optimizing bottlenecks in validation code.
Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Breaking Changes

Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

Contributors

Flamefire, ilya-lavrenov, and 9 other contributors

Assets 2

08 Jan 23:49

vpirogov

v3.3.4

f240e12

v3.3.4

This is a patch release containing the following changes to v3.3.3:

Fixed performance regression in convolution, matmul and inner product primitives with post-ops on Intel CPUs (2e3c94c)
Fixed performance regression in bfloat16 matmul on processors with Intel AMX instruction set support (c0ae38c, fa43640)
Fixed SEGFAULT in 3D convolutions with different h and w parameters on Intel CPUs (b5f916e)
Fixed performance regression in fp32 convolution backpropagation on Intel CPUs (ee3b12d)
Reduced benchdnn memory consumption on Intel GPUs (84a8f57)

Assets 2

14 Dec 19:40

vpirogov

v3.3.3

16720ea

v3.3.3

This is a patch release containing the following changes to v3.3.2:

Fixed performance regression in int8 convolutions on processors with Intel AVX-512 and Intel DL Boost support (a00661f)
Fixed race condition during library initialization on Intel Data Center GPU Max Series (7dfcd11)
Fixed accuracy issue in experimental Graph Compiler with LLVM code generator (8892e7e)
Disabled int8 RNN implementation for cases with non-trivial strides (2195e4b)
Fixed incorrect results in bfloat16 convolution implementation on processors with Intel AMX support (9f00af9)
Fixed incorrect results in fp16 and int8 convolution on Intel Core Ultra integrated GPUs (69cef84, 79bc6cc, c9c0b09)

Assets 2

30 Nov 15:55

vpirogov

v3.3.2

2dc95a2

v3.3.2

This is a patch release containing the following changes to v3.3.1:

Fixed incorrect results in bfloat16 reorder on Intel Core Ultra integrates GPUs (9025980, ed9de2a, 0c6bda1)
Fixed incorrect results in matmul, inner product, and RNN primitives on Intel Core Ultra integrated GPUs (6edab9f)
Updated compiler optimization flags for AArch64 processors to make build portable (8829c24)
Fixed segmentation fault during library initialization on AArch64 processors (3e15c61)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Optimizations

Functionality

Usability

Validation

Thanks to these Contributors

Contributors

Performance Optimizations

Functionality

Usability

Validation

Known Limitations

Breaking Changes

Thanks to these Contributors

Contributors

Performance Optimizations

Functionality

Usability

Validation

Breaking Changes

Thanks to these Contributors

Contributors

Releases: oneapi-src/oneDNN

v3.5-rc

Performance Optimizations

Functionality

Usability

Validation

Thanks to these Contributors

Contributors

v3.4.2

v3.4.1

v3.3.6

v3.4

Performance Optimizations

Functionality

Usability

Validation

Known Limitations

Breaking Changes

Thanks to these Contributors

Contributors

v3.3.5

v3.4-rc

Performance Optimizations

Functionality

Usability

Validation

Breaking Changes

Thanks to these Contributors

Contributors

v3.3.4

v3.3.3

v3.3.2