Symmetric QGEMM kernel for ARMv8 A55 chip (#10754)

ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions. This change adds a Symmetric QGEMM kernel for a55 micro-architecture, where we replace ldr q4,[x1],#16 with ldr d4,[x1],#8 ldr x11,[x1],#8 ins v4.d[1],x11 so that we can try to hide the memory load cycles behind computing cycles in the kernel. Co-authored-by: Chen Fu <[email protected]>
microsoft · Mar 7, 2022 · 50a6f09 · 50a6f09
1 parent 55af7a9
commit 50a6f09
Show file tree

Hide file tree

Showing 5 changed files with 955 additions and 3 deletions.
diff --git a/cmake/onnxruntime_mlas.cmake b/cmake/onnxruntime_mlas.cmake
@@ -62,6 +62,7 @@ function(setup_mlas_source_for_windows)
         ${MLAS_SRC_DIR}/arm64/SgemvKernelNeon.asm
         ${MLAS_SRC_DIR}/arm64/SymQgemmS8KernelNeon.asm
         ${MLAS_SRC_DIR}/arm64/SymQgemmS8KernelSDot.asm
+        ${MLAS_SRC_DIR}/arm64/SymQgemmS8KernelSDotLd64.asm
       )
     else()
       target_sources(onnxruntime_mlas PRIVATE
@@ -290,6 +291,7 @@ else()
           ${MLAS_SRC_DIR}/aarch64/SgemvKernelNeon.S
           ${MLAS_SRC_DIR}/aarch64/SymQgemmS8KernelNeon.S
           ${MLAS_SRC_DIR}/aarch64/SymQgemmS8KernelSdot.S
+          ${MLAS_SRC_DIR}/aarch64/SymQgemmS8KernelSdotLd64.S
           ${MLAS_SRC_DIR}/qgemm_kernel_neon.cpp
           ${MLAS_SRC_DIR}/qgemm_kernel_udot.cpp
           ${MLAS_SRC_DIR}/qgemm_kernel_sdot.cpp

diff --git a/onnxruntime/core/mlas/lib/aarch64/SymQgemmS8KernelSdot.S b/onnxruntime/core/mlas/lib/aarch64/SymQgemmS8KernelSdot.S
@@ -18,14 +18,15 @@ Abstract:
     constant. When the packed right hand side is cached, we achieves higher performance
     by avoid packing all together.
 
+    This version utilizes dot product instructions, and uses 128b loads
+
 --*/
 
 #include "asmmacro.h"
 #include "AssembleDotProduct.h"
 
 //
-// Stack frame layout for the symmetric convolution kernel.
-// d8-d15, x19-x30 need to be preserved if used
+// Stack frame layout d8-d15, x19-x30 need to be preserved if used
 //
         .equ    .LGemmS8S8KernelFrame_SavedRegisters,   (4 * 8)
         .equ    .LGemmS8S8KernelFrame_ColumnSumBuffer,  (0 + .LGemmS8S8KernelFrame_SavedRegisters)