[ARM CPU] Add Fp16 kernels for MatMulNBits #22651

fajin-corp · 2024-10-30T01:16:35Z

Description

Add Fp16 kernels for MatMulNBits.
Support Fp16 A calculate using accuracy 2.

BlkLen:128/Symmetric:0/HasBias:1

Thread	M	N	K	Fp32 Time	Fp16 Time	Fp16 latency reduction
1	1	4096	3072	5086301 ns	2912092 ns	42.7%
1	1	4096	11008	17866090 ns	10989713 ns	38.5%
1	1	11008	3072	13763608 ns	7844626 ns	43.0%
1	4096	4096	3072	2843439224 ns	1954152587 ns	31.3%
8	1	4096	3072	627008 ns	371404 ns	40.8%
8	1	4096	11008	2229758 ns	1370499 ns	38.5%
8	1	11008	3072	1713451 ns	1008165 ns	41.2%
8	4096	4096	3072	374325569 ns	250992166 ns	32.9%

Motivation and Context

Add cross-device data type support.

amarin16 · 2024-10-30T02:27:35Z

There seem to be conflicts in matmul_nbits.cc and matmul_4bits_test.cc #Resolved

edgchen1

initial review

edgchen1 · 2024-10-30T18:08:02Z

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

+    return CompInt8;
+  }
+  // Fallback to fp16. If fp16 optimized path is not available, it will further fall back to fp32.
+  return CompFp16;


so this will return CompFp16 even if accuracy_level_attr is CompFp32?

I don't see a point of using CompFp32 for fp16 input if CompFp16 is available. converting fp16 to fp32 does not bring more precision, and the casting only makes the performance worse.

I agree that it doesn't make sense for fp16 input. for fp16 input, what do you think about treating the default accuracy level value (unset) as CompFp16 and treating an explicit accuracy level of CompFp32 as an error?

if accuracy 1 is given for fp16 input, maybe show a warning and use compFp16?

sure, warning is good too

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

onnxruntime/core/mlas/inc/mlas_qnbit.h

onnxruntime/core/mlas/lib/fp16_common.h

onnxruntime/core/mlas/lib/sqnbitgemm.h

onnxruntime/test/mlas/unittest/test_sqnbitgemm_neon_fp16.cpp

onnxruntime/core/mlas/lib/sqnbitgemm.cpp

onnxruntime/core/mlas/lib/sqnbitgemm.h

fajin-corp · 2024-10-31T18:03:56Z

resolved

In reply to: 2445691891

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

onnxruntime/core/mlas/lib/qnbitgemm.cpp

+    }
+}
+
+void SQ4BitGemm_CompInt8(


onnxruntime/core/mlas/lib/qnbitgemm.cpp

onnxruntime/core/mlas/lib/qnbitgemm.h

+        PackedQuantBData = reinterpret_cast<std::byte*>(MlasAlignAddress(PackedQuantBWorkspace, 32));
+        QuantBBlkSum = reinterpret_cast<T*>(PackedQuantBData + PackedQuantBDataSize);
+        QuantBBlkSum = reinterpret_cast<T*>(MlasAlignAddress(QuantBBlkSum, MlasQNBitQuantBBlkSumAlignment()));
+        PackedQuantBScale = reinterpret_cast<T*>(reinterpret_cast<std::byte*>(QuantBBlkSum) + BlkSumSize);


edgchen1

looks good. had a few comments.

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

onnxruntime/core/mlas/inc/mlas_qnbit.h

onnxruntime/core/mlas/lib/qnbitgemm.h

edgchen1 · 2024-11-04T20:01:18Z

onnxruntime/test/mlas/bench/bench_util.h

 template <typename ElementType>
-std::vector<ElementType> RandomVectorUniform(
+typename std::enable_if_t<!std::is_same_v<ElementType, MLAS_FP16>, std::vector<ElementType>>


nit: would it be simpler to have a specialization for MLAS_FP16 instead of two enable_ifs?

since this is a .h file included in multiple .cpp files, using specialization will trigger redefinition. so I chose to use enable_if

onnxruntime/test/mlas/unittest/test_hqnbitgemm_neon_fp16.cpp

### Description A break down PR of #22651 Add fp16 kernels. ### Motivation and Context

### Description A break-down PR of #22651 Op API change only. - add template to functions and classes that support fp32 and fp16 - rename functions, classes and files that support fp32 and fp16 from SQNBxxx to QNBxxx ### Motivation and Context

### Description A breakdown PR of #22651 ### Motivation and Context

### Description A break down PR of microsoft#22651 Add fp16 kernels. ### Motivation and Context

### Description A break-down PR of microsoft#22651 Op API change only. - add template to functions and classes that support fp32 and fp16 - rename functions, classes and files that support fp32 and fp16 from SQNBxxx to QNBxxx ### Motivation and Context

### Description A breakdown PR of microsoft#22651 ### Motivation and Context