cpu: aarch64: allow sbgemm config for matmul primitive #2068

snadampal · 2024-09-01T21:35:35Z

Description

Please include a summary of the change. Please also include relevant motivation and context. See contribution guidelines for more details. If the change fixes an issue not documented in the project's Github issue tracker, please document all steps necessary to reproduce it.

This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom.

allow additional blocked layout formats.
use bfloat16 fast math kernels from openxla.

Fixes # (github issue)

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
ran make test and
./benchdnn --matmul --mode=P --engine=cpu --allow-enum-tags-only=0 --batch=inputs/matmul/test_matmul_ci
[ x] Have you formatted the code using clang-format?

Performance improvements

Have you submitted performance data that demonstrates performance improvements?

New features

Have you published an RFC for the new feature?
Was the RFC approved?
Have you added relevant tests?

Bug fixes

Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
Have you added relevant regression tests?

RFC PR

Does RFC document follow the template?
Have you added a link to the rendered document?

cfRod · 2024-12-16T18:15:39Z

src/cpu/aarch64/matmul/acl_matmul_utils.cpp

@@ -84,11 +84,8 @@ status_t init_conf_matmul(acl_matmul_conf_t &amp, memory_desc_t &src_md,
    } else {
        auto src_tag = memory_desc_matches_one_of_tag(
                src_md, acdb, abcd, abdc, abc, acb, ab, ba);
-        auto wei_tag = memory_desc_matches_one_of_tag(


how is this different from the previous if block (Lines 77 to 83) which deals with the FixedFormat case? I guess the check for plain formats is removed to support passing blocked format Ab8a via wtag?

What happens when you choose a different blocked format other than --wtag=Ab8a ? e.g --wtag=aBdc8b or ABcd8b8a ?

Hi @cfRod , this is slightly different from the regular fixed format scenario. This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom.

btw, it's not just the Ab8a... format, even the ab, acbd etc are coming as blocked formats, thats why I combined the logic with the else part (Lined 84 to 87) instead of the format::any.

I had added a similar support to inner_product primitive sometime back, here is the PR, please check for more context. it was more straight forward there because everything was handled as a fixed format there
#1768

cfRod · 2024-12-17T10:47:40Z

tests/benchdnn/inputs/matmul/test_matmul_ci

@@ -211,3 +211,11 @@

 # fp4
 --batch=test_matmul_fp4
+


Can you add a test case covering the change enable f32bf16fp32 datatypes i.e with --dt=f32:bf16:f32 ? Could you also paste the verbose output before and after the change?

Hi @cfRod , I have added the unit tests for SBGEMM (f32:bf16:f32) scenario for both fixed format (format any) and also blocked layout.
Here is the dnnl verbose for the new tests:

./benchdnn --matmul --mode=P --engine=cpu --allow-enum-tags-only=0 --batch=inputs/matmul/test_matmul_ci Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops% perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --stag=ab --wtag=Ab8a --dtag=ab --attr-post-ops=mul:f32 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.427246,0.0219727,0.0218453,0.0242582,0.0197871 perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --stag=ab --wtag=Ab8a --dtag=ab --attr-post-ops=relu 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0617676,0.0214844,0.0223418,0.0227168,0.0211298 perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --stag=ab --wtag=Ab8a --dtag=ab --attr-post-ops=sum 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.043457,0.0212402,0.0225986,0.0223903,0.0214378 perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --dtag=ab --attr-post-ops=mul:f32 --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0732422,0.00244141,0.196608,0.00275341,0.174329 perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --dtag=ab --attr-post-ops=relu --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0361328,0.00195312,0.24576,0.00227654,0.210846 perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --dtag=ab --attr-post-ops=sum --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0354004,0.00195312,0.24576,0.00230451,0.208287 perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --wtag=Ba4b --dtag=ab --attr-post-ops=mul:f32 --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0439453,0.0236816,0.0202689,0.0257452,0.0186443 perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --wtag=Ba4b --dtag=ab --attr-post-ops=relu --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0483398,0.0217285,0.0220908,0.0227154,0.021131 perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --wtag=Ba4b --dtag=ab --attr-post-ops=sum --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0407715,0.0214844,0.0223418,0.0226315,0.0212094 tests:9 passed:9 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0 total perf: min(ms):0.137939 avg(ms):0.147792 total: 27.22s; fill: 0.00s (0%);

To support sbgemm primitive with blocked weighs (pre re-ordered weights), I had to extend Arm Compute Library Gemm operator. Following is the PR for that, please review this change as well.
https://review.mlplatform.org/c/ml/ComputeLibrary/+/13341

Approved based on https://review.mlplatform.org/c/ml/ComputeLibrary/+/13341 getting merged.

This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom.

This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom. When fastmath mode is enabled and graph is pre-compiled, the weight tensors come in reordered bfloat16 format

theComputeKid

Looks good to me. As Crefeda said, we will merge once the corresponding commit in ACL gets merged and is included in a tag release that we can take. Either as part of this change or separately we will need to upgrade ACL in CI to test it, when the time comes.

github-actions bot added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Sep 1, 2024

snadampal mentioned this pull request Oct 14, 2024

aarch64: implement onednn matmul operator with explicit reorders openxla/xla#16438

Closed

snadampal force-pushed the matmul_sbgemm_blocked_weights branch from 8587737 to 460b037 Compare December 16, 2024 01:24

snadampal marked this pull request as ready for review December 16, 2024 01:26

snadampal requested review from a team as code owners December 16, 2024 01:26

snadampal force-pushed the matmul_sbgemm_blocked_weights branch from 460b037 to ca3407d Compare December 16, 2024 01:35

cfRod reviewed Dec 16, 2024

View reviewed changes

cfRod reviewed Dec 17, 2024

View reviewed changes

snadampal added 2 commits December 17, 2024 19:52

cpu: aarch64: accept reordered blocked format for matmul weight tensors

263a048

This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom.

snadampal force-pushed the matmul_sbgemm_blocked_weights branch from ca3407d to 76784ba Compare December 17, 2024 19:57

github-actions bot added the component:tests Codeowner: @oneapi-src/onednn-arch label Dec 17, 2024

theComputeKid requested changes Dec 19, 2024

View reviewed changes

Sqvid mentioned this pull request Jan 15, 2025

build(aarch64): Update to oneDNN-3.7 + ACL-24.12 tensorflow/tensorflow#84975

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: aarch64: allow sbgemm config for matmul primitive #2068

cpu: aarch64: allow sbgemm config for matmul primitive #2068

snadampal commented Sep 1, 2024 •

edited

Loading

cfRod Dec 16, 2024

snadampal Dec 16, 2024

cfRod Dec 17, 2024

snadampal Dec 17, 2024

cfRod Dec 18, 2024

theComputeKid left a comment

cpu: aarch64: allow sbgemm config for matmul primitive #2068

Are you sure you want to change the base?

cpu: aarch64: allow sbgemm config for matmul primitive #2068

Conversation

snadampal commented Sep 1, 2024 • edited Loading

Description

Checklist

General

Performance improvements

New features

Bug fixes

RFC PR

cfRod Dec 16, 2024

Choose a reason for hiding this comment

snadampal Dec 16, 2024

Choose a reason for hiding this comment

cfRod Dec 17, 2024

Choose a reason for hiding this comment

snadampal Dec 17, 2024

Choose a reason for hiding this comment

cfRod Dec 18, 2024

Choose a reason for hiding this comment

theComputeKid left a comment

Choose a reason for hiding this comment

snadampal commented Sep 1, 2024 •

edited

Loading