Update default M from 2 to 8 #35

a3213105 · 2023-11-01T06:01:09Z

So when matrix M<=8, we will use ig kernel to boost bf16 gemms, otherwise use oneDNN kernel.
From the performance data， we can see that with M=8， It can boost next token performance of BS=4/8, especially for BS=4.
This patch doesn't affect first token latency since the M of first latency is larger than 8.

batchsize	1	2	4	8	16
chatglm-6b
original M=2	30.58	31.03	63.56	67.61	78.38
M=7	29.54	32.58	41.28	68.14	78.2
M=8	29.87	32.88	40.57	66.82	78.83

chatglm2-6b
original M=2	30.7	31.61	63.11	67.32	83.61
M=7	29.94	33.3	41.1	70.68	83.49
M=8	31.76	31.84	41.01	63.84	78.83

llama-7b
original M=2	32.38	36.47	71.6	78.45	92.83
M=7	33.88	37.54	46.43	79.64	93.27
M=8	32.64	36.59	46.04	75.14	93.18

…st bf16 gemms, otherwise use oneDNN kernel.

Update default M=8, so when matrix M<=8, we will use ig kernel to boo…

3f67cc7

…st bf16 gemms, otherwise use oneDNN kernel.

changqi1 approved these changes Nov 1, 2023

View reviewed changes

Duyi-Wang merged commit 7e448f6 into intel:main Nov 2, 2023

a3213105 deleted the mulmal_default_M_size branch February 22, 2024 05:48

Duyi-Wang added a commit that referenced this pull request Aug 21, 2024

[Comm] Merge dlopen call into a inline func. (#35)

87867f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update default M from 2 to 8 #35

Update default M from 2 to 8 #35

a3213105 commented Nov 1, 2023

Update default M from 2 to 8 #35

Update default M from 2 to 8 #35

Conversation

a3213105 commented Nov 1, 2023