Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu: aarch64: allow sbgemm config for matmul primitive #2068

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

snadampal
Copy link
Contributor

@snadampal snadampal commented Sep 1, 2024

Description

Please include a summary of the change. Please also include relevant motivation and context. See contribution guidelines for more details. If the change fixes an issue not documented in the project's Github issue tracker, please document all steps necessary to reproduce it.

This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom.

  1. allow additional blocked layout formats.
  2. use bfloat16 fast math kernels from openxla.

Fixes # (github issue)

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
    ran make test and
    ./benchdnn --matmul --mode=P --engine=cpu --allow-enum-tags-only=0 --batch=inputs/matmul/test_matmul_ci
  • [ x] Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

New features

  • Have you published an RFC for the new feature?
  • Was the RFC approved?
  • Have you added relevant tests?

Bug fixes

  • Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
  • Have you added relevant regression tests?

RFC PR

  • Does RFC document follow the template?
  • Have you added a link to the rendered document?

@github-actions github-actions bot added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Sep 1, 2024
@snadampal snadampal force-pushed the matmul_sbgemm_blocked_weights branch from 8587737 to 460b037 Compare December 16, 2024 01:24
@snadampal snadampal marked this pull request as ready for review December 16, 2024 01:26
@snadampal snadampal requested review from a team as code owners December 16, 2024 01:26
@snadampal snadampal force-pushed the matmul_sbgemm_blocked_weights branch from 460b037 to ca3407d Compare December 16, 2024 01:35
@@ -84,11 +84,8 @@ status_t init_conf_matmul(acl_matmul_conf_t &amp, memory_desc_t &src_md,
} else {
auto src_tag = memory_desc_matches_one_of_tag(
src_md, acdb, abcd, abdc, abc, acb, ab, ba);
auto wei_tag = memory_desc_matches_one_of_tag(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this different from the previous if block (Lines 77 to 83) which deals with the FixedFormat case? I guess the check for plain formats is removed to support passing blocked format Ab8a via wtag?

What happens when you choose a different blocked format other than --wtag=Ab8a ? e.g --wtag=aBdc8b or ABcd8b8a ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cfRod , this is slightly different from the regular fixed format scenario. This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom.

btw, it's not just the Ab8a... format, even the ab, acbd etc are coming as blocked formats, thats why I combined the logic with the else part (Lined 84 to 87) instead of the format::any.

I had added a similar support to inner_product primitive sometime back, here is the PR, please check for more context. it was more straight forward there because everything was handled as a fixed format there
#1768

@@ -211,3 +211,11 @@

# fp4
--batch=test_matmul_fp4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case covering the change enable f32bf16fp32 datatypes i.e with --dt=f32:bf16:f32 ? Could you also paste the verbose output before and after the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cfRod , I have added the unit tests for SBGEMM (f32:bf16:f32) scenario for both fixed format (format any) and also blocked layout.
Here is the dnnl verbose for the new tests:

./benchdnn --matmul --mode=P --engine=cpu --allow-enum-tags-only=0 --batch=inputs/matmul/test_matmul_ci
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --stag=ab --wtag=Ab8a --dtag=ab --attr-post-ops=mul:f32 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.427246,0.0219727,0.0218453,0.0242582,0.0197871
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --stag=ab --wtag=Ab8a --dtag=ab --attr-post-ops=relu 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0617676,0.0214844,0.0223418,0.0227168,0.0211298
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --stag=ab --wtag=Ab8a --dtag=ab --attr-post-ops=sum 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.043457,0.0212402,0.0225986,0.0223903,0.0214378
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --dtag=ab --attr-post-ops=mul:f32 --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0732422,0.00244141,0.196608,0.00275341,0.174329
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --dtag=ab --attr-post-ops=relu --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0361328,0.00195312,0.24576,0.00227654,0.210846
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --dtag=ab --attr-post-ops=sum --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0354004,0.00195312,0.24576,0.00230451,0.208287
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --wtag=Ba4b --dtag=ab --attr-post-ops=mul:f32 --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0439453,0.0236816,0.0202689,0.0257452,0.0186443
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --wtag=Ba4b --dtag=ab --attr-post-ops=relu --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0483398,0.0217285,0.0220908,0.0227154,0.021131
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --wtag=Ba4b --dtag=ab --attr-post-ops=sum --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0407715,0.0214844,0.0223418,0.0226315,0.0212094
tests:9 passed:9 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.137939 avg(ms):0.147792
total: 27.22s; fill: 0.00s (0%);

To support sbgemm primitive with blocked weighs (pre re-ordered weights), I had to extend Arm Compute Library Gemm operator. Following is the PR for that, please review this change as well.
https://review.mlplatform.org/c/ml/ComputeLibrary/+/13341

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to support precompiled graphs where
the primitive gets created with the reordered (already reordered)
weight tensors, so their formats are blocked and more custom.
This is required to support precompiled graphs
where the primitive gets created with the
reordered (already reordered) weight tensors,
so their formats are blocked and more custom.
When fastmath mode is enabled and graph is pre-compiled,
the weight tensors come in reordered bfloat16 format
@snadampal snadampal force-pushed the matmul_sbgemm_blocked_weights branch from ca3407d to 76784ba Compare December 17, 2024 19:57
@github-actions github-actions bot added the component:tests Codeowner: @oneapi-src/onednn-arch label Dec 17, 2024
Copy link
Member

@theComputeKid theComputeKid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. As Crefeda said, we will merge once the corresponding commit in ACL gets merged and is included in a tag release that we can take. Either as part of this change or separately we will need to upgrade ACL in CI to test it, when the time comes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:tests Codeowner: @oneapi-src/onednn-arch platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants