-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu: aarch64: allow sbgemm config for matmul primitive #2068
base: main
Are you sure you want to change the base?
cpu: aarch64: allow sbgemm config for matmul primitive #2068
Conversation
8587737
to
460b037
Compare
460b037
to
ca3407d
Compare
@@ -84,11 +84,8 @@ status_t init_conf_matmul(acl_matmul_conf_t &, memory_desc_t &src_md, | |||
} else { | |||
auto src_tag = memory_desc_matches_one_of_tag( | |||
src_md, acdb, abcd, abdc, abc, acb, ab, ba); | |||
auto wei_tag = memory_desc_matches_one_of_tag( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this different from the previous if block (Lines 77 to 83) which deals with the FixedFormat case? I guess the check for plain formats is removed to support passing blocked format Ab8a via wtag?
What happens when you choose a different blocked format other than --wtag=Ab8a ? e.g --wtag=aBdc8b or ABcd8b8a ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cfRod , this is slightly different from the regular fixed format scenario. This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom.
btw, it's not just the Ab8a
... format, even the ab
, acbd
etc are coming as blocked formats, thats why I combined the logic with the else
part (Lined 84 to 87) instead of the format::any.
I had added a similar support to inner_product
primitive sometime back, here is the PR, please check for more context. it was more straight forward there because everything was handled as a fixed format there
#1768
@@ -211,3 +211,11 @@ | |||
|
|||
# fp4 | |||
--batch=test_matmul_fp4 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case covering the change enable f32bf16fp32 datatypes i.e with --dt=f32:bf16:f32 ? Could you also paste the verbose output before and after the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cfRod , I have added the unit tests for SBGEMM (f32:bf16:f32) scenario for both fixed format (format any) and also blocked layout.
Here is the dnnl verbose for the new tests:
./benchdnn --matmul --mode=P --engine=cpu --allow-enum-tags-only=0 --batch=inputs/matmul/test_matmul_ci
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --stag=ab --wtag=Ab8a --dtag=ab --attr-post-ops=mul:f32 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.427246,0.0219727,0.0218453,0.0242582,0.0197871
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --stag=ab --wtag=Ab8a --dtag=ab --attr-post-ops=relu 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0617676,0.0214844,0.0223418,0.0227168,0.0211298
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --stag=ab --wtag=Ab8a --dtag=ab --attr-post-ops=sum 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.043457,0.0212402,0.0225986,0.0223903,0.0214378
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --dtag=ab --attr-post-ops=mul:f32 --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0732422,0.00244141,0.196608,0.00275341,0.174329
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --dtag=ab --attr-post-ops=relu --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0361328,0.00195312,0.24576,0.00227654,0.210846
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --dtag=ab --attr-post-ops=sum --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0354004,0.00195312,0.24576,0.00230451,0.208287
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --wtag=Ba4b --dtag=ab --attr-post-ops=mul:f32 --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0439453,0.0236816,0.0202689,0.0257452,0.0186443
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --wtag=Ba4b --dtag=ab --attr-post-ops=relu --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0483398,0.0217285,0.0220908,0.0227154,0.021131
perf,cpu,gemm:acl,"postops+runtime_dims_2d",--mode=P --matmul --allow-enum-tags-only=false --dt=f32:bf16:f32 --stag=ab --wtag=Ba4b --dtag=ab --attr-post-ops=sum --attr-fpmath=bf16 3x20:20x4_n"postops+runtime_dims_2d",4.8e-07,0.0407715,0.0214844,0.0223418,0.0226315,0.0212094
tests:9 passed:9 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.137939 avg(ms):0.147792
total: 27.22s; fill: 0.00s (0%);
To support sbgemm primitive with blocked weighs (pre re-ordered weights), I had to extend Arm Compute Library Gemm operator. Following is the PR for that, please review this change as well.
https://review.mlplatform.org/c/ml/ComputeLibrary/+/13341
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved based on https://review.mlplatform.org/c/ml/ComputeLibrary/+/13341 getting merged.
This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom.
This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom. When fastmath mode is enabled and graph is pre-compiled, the weight tensors come in reordered bfloat16 format
ca3407d
to
76784ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. As Crefeda said, we will merge once the corresponding commit in ACL gets merged and is included in a tag release that we can take. Either as part of this change or separately we will need to upgrade ACL in CI to test it, when the time comes.
Description
Please include a summary of the change. Please also include relevant motivation and context. See contribution guidelines for more details. If the change fixes an issue not documented in the project's Github issue tracker, please document all steps necessary to reproduce it.
This is required to support precompiled graphs where the primitive gets created with the reordered (already reordered) weight tensors, so their formats are blocked and more custom.
Fixes # (github issue)
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?ran
make test
and./benchdnn --matmul --mode=P --engine=cpu --allow-enum-tags-only=0 --batch=inputs/matmul/test_matmul_ci
Performance improvements
New features
Bug fixes
RFC PR