Enable Mixtral8x7B #138

intellinjun · 2024-02-23T01:44:03Z

Type of Change

feature or bug fix or documentation or others
Model enabling

Description

Enable Mixtral-8x7B

Three new OP,mul_mat_id,argsort,top_k
Two hparameter n_experts and n_experts_used
Use GGUF format model
Support F32 F16 Jblas

Signed-off-by: intellinjun <[email protected]>

…mixtral

Signed-off-by: intellinjun <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: intellinjun <[email protected]>

…mixtral

Signed-off-by: intellinjun <[email protected]>

airMeng

MOE only applies to llama now?

neural_speed/convert/convert_bloom.py

neural_speed/core/ne_layers.c

a32543254

LGTM

neural_speed/models/llama/llama_utils.cpp

neural_speed/models/llama/llama.cpp

neural_speed/core/layers/Ops.h

neural_speed/models/llama/llama.cpp

intellinjun · 2024-02-26T01:29:41Z

MOE only applies to llama now?

MOE only applies to mixtral, mixtral is only different from ffn compared to llama

a32543254 · 2024-02-26T01:36:39Z

Could you post some performance data for this new model here?

intellinjun · 2024-02-26T01:39:18Z

Could you post some performance data for this new model here?

I will add it to ci and use ci to test performance

neural_speed/convert/convert_llama.py

Zhenzhong1 · 2024-02-26T02:28:57Z

If the GGUF version is updated in the future, please add the GGUF format for the mixtral if possible.

Signed-off-by: intellinjun <[email protected]>

…mixtral

intellinjun · 2024-02-27T07:16:41Z

https://inteltf-jenk.sh.intel.com/job/neural_speed_extension/59/

Signed-off-by: intellinjun <[email protected]>

intellinjun · 2024-02-29T03:02:22Z

model	quant_type	input_token	output_token	cores/instance	first token	next token	total latency
mixtral-8x7b	q4_j_b128	32	32	32	803.09	184.41	6519.8
mixtral-8x7b	q4_j_b128	32	32	48	836.7	208.27	7293.07
mixtral-8x7b	q4_j_b128	32	32	56	741.93	210.23	7259.06
mixtral-8x7b	q4_j_b128	1024	32	32	20802.97	189.49	26677.16
mixtral-8x7b	q4_j_b128	1024	32	48	18736	211.79	25301.49
mixtral-8x7b	q4_j_b128	1024	32	56	17763.37	217.67	24511.14
mixtral-8x7b	q4_j_b32	32	32	32	915.76	201.5	7162.26
mixtral-8x7b	q4_j_b32	32	32	48	893.81	215.74	7581.75
mixtral-8x7b	q4_j_b32	32	32	56	867.74	221.13	7722.77
mixtral-8x7b	q4_j_b32	1024	32	32	22315	200.47	28529.57
mixtral-8x7b	q4_j_b32	1024	32	48	20216.37	222.51	27114.18
mixtral-8x7b	q4_j_b32	1024	32	56	20029.26	230.25	27167.01

Signed-off-by: intellinjun <[email protected]>

…mixtral

Signed-off-by: intellinjun <[email protected]>

…mixtral

Signed-off-by: intellinjun <[email protected]>

intellinjun · 2024-02-29T23:40:13Z

model	quant_type	input_token	output_token	cores/instance	first token	next token	total latency
mixtral-8x7b	q4_j_b128	32	32	32	120.17	41.14	1395.51
mixtral-8x7b	q4_j_b128	32	32	48	139.71	37.54	1303.45
mixtral-8x7b	q4_j_b128	32	32	56	144.02	37.73	1313.65
mixtral-8x7b	q4_j_b128	1024	32	32	1408.29	44.63	2791.82
mixtral-8x7b	q4_j_b128	1024	32	48	1418.5	40	2658.5
mixtral-8x7b	q4_j_b128	1024	32	56	1301.49	42.21	2610
mixtral-8x7b	q4_j_b32	32	32	32	250.65	52.39	1874.74
mixtral-8x7b	q4_j_b32	32	32	48	391.57	52.54	2020.31
mixtral-8x7b	q4_j_b32	32	32	56	280.68	49.85	1826.03
mixtral-8x7b	q4_j_b32	1024	32	32	3414.51	59.43	5256.84
mixtral-8x7b	q4_j_b32	1024	32	48	2960.75	52.52	4588.87
mixtral-8x7b	q4_j_b32	1024	32	56	2841.19	51.34	4432.73

intellinjun · 2024-02-29T23:40:51Z

<style> </style>
model quant_type input_token output_token cores/instance first token next token total latency
mixtral-8x7b q4_j_b128 32 32 32 120.17 41.14 1395.51
mixtral-8x7b q4_j_b128 32 32 48 139.71 37.54 1303.45
mixtral-8x7b q4_j_b128 32 32 56 144.02 37.73 1313.65
mixtral-8x7b q4_j_b128 1024 32 32 1408.29 44.63 2791.82
mixtral-8x7b q4_j_b128 1024 32 48 1418.5 40 2658.5
mixtral-8x7b q4_j_b128 1024 32 56 1301.49 42.21 2610
mixtral-8x7b q4_j_b32 32 32 32 250.65 52.39 1874.74
mixtral-8x7b q4_j_b32 32 32 48 391.57 52.54 2020.31
mixtral-8x7b q4_j_b32 32 32 56 280.68 49.85 1826.03
mixtral-8x7b q4_j_b32 1024 32 32 3414.51 59.43 5256.84
mixtral-8x7b q4_j_b32 1024 32 48 2960.75 52.52 4588.87
mixtral-8x7b q4_j_b32 1024 32 56 2841.19 51.34 4432.73

use mul_mat_id_silu fusion

Signed-off-by: intellinjun <[email protected]>

luoyu-intel · 2024-03-01T02:35:54Z

<style> </style>
model quant_type input_token output_token cores/instance first token next token total latency
mixtral-8x7b q4_j_b128 32 32 32 120.17 41.14 1395.51
mixtral-8x7b q4_j_b128 32 32 48 139.71 37.54 1303.45
mixtral-8x7b q4_j_b128 32 32 56 144.02 37.73 1313.65
mixtral-8x7b q4_j_b128 1024 32 32 1408.29 44.63 2791.82
mixtral-8x7b q4_j_b128 1024 32 48 1418.5 40 2658.5
mixtral-8x7b q4_j_b128 1024 32 56 1301.49 42.21 2610
mixtral-8x7b q4_j_b32 32 32 32 250.65 52.39 1874.74
mixtral-8x7b q4_j_b32 32 32 48 391.57 52.54 2020.31
mixtral-8x7b q4_j_b32 32 32 56 280.68 49.85 1826.03
mixtral-8x7b q4_j_b32 1024 32 32 3414.51 59.43 5256.84
mixtral-8x7b q4_j_b32 1024 32 48 2960.75 52.52 4588.87
mixtral-8x7b q4_j_b32 1024 32 56 2841.19 51.34 4432.73

use mul_mat_id_silu fusion

It's SiLu that slows down the first-token inference, right?

intellinjun · 2024-03-01T02:40:14Z

<style> </style>
model quant_type input_token output_token cores/instance first token next token total latency
mixtral-8x7b q4_j_b128 32 32 32 120.17 41.14 1395.51
mixtral-8x7b q4_j_b128 32 32 48 139.71 37.54 1303.45
mixtral-8x7b q4_j_b128 32 32 56 144.02 37.73 1313.65
mixtral-8x7b q4_j_b128 1024 32 32 1408.29 44.63 2791.82
mixtral-8x7b q4_j_b128 1024 32 48 1418.5 40 2658.5
mixtral-8x7b q4_j_b128 1024 32 56 1301.49 42.21 2610
mixtral-8x7b q4_j_b32 32 32 32 250.65 52.39 1874.74
mixtral-8x7b q4_j_b32 32 32 48 391.57 52.54 2020.31
mixtral-8x7b q4_j_b32 32 32 56 280.68 49.85 1826.03
mixtral-8x7b q4_j_b32 1024 32 32 3414.51 59.43 5256.84
mixtral-8x7b q4_j_b32 1024 32 48 2960.75 52.52 4588.87
mixtral-8x7b q4_j_b32 1024 32 56 2841.19 51.34 4432.73

use mul_mat_id_silu fusion

It's SiLu that slows down the first-token inference, right?

Silu slows down the next token inference ,first token is due to a problem with the thread setup in mul_mat_id

a32543254 · 2024-03-01T02:43:54Z

<style> </style>
model quant_type input_token output_token cores/instance first token next token total latency
mixtral-8x7b q4_j_b128 32 32 32 120.17 41.14 1395.51
mixtral-8x7b q4_j_b128 32 32 48 139.71 37.54 1303.45
mixtral-8x7b q4_j_b128 32 32 56 144.02 37.73 1313.65
mixtral-8x7b q4_j_b128 1024 32 32 1408.29 44.63 2791.82
mixtral-8x7b q4_j_b128 1024 32 48 1418.5 40 2658.5
mixtral-8x7b q4_j_b128 1024 32 56 1301.49 42.21 2610
mixtral-8x7b q4_j_b32 32 32 32 250.65 52.39 1874.74
mixtral-8x7b q4_j_b32 32 32 48 391.57 52.54 2020.31
mixtral-8x7b q4_j_b32 32 32 56 280.68 49.85 1826.03
mixtral-8x7b q4_j_b32 1024 32 32 3414.51 59.43 5256.84
mixtral-8x7b q4_j_b32 1024 32 48 2960.75 52.52 4588.87
mixtral-8x7b q4_j_b32 1024 32 56 2841.19 51.34 4432.73

Nice!

neural_speed/models/llama/llama.cpp

Signed-off-by: intellinjun <[email protected]>

intellinjun and others added 14 commits February 7, 2024 14:59

fix top-k and argsort error

8905a97

Signed-off-by: intellinjun <[email protected]>

enable mistral8X7b f32 gguf

cbe21a9

Signed-off-by: intellinjun <[email protected]>

enable moe jblas

692e2d0

Signed-off-by: intellinjun <[email protected]>

add write and read n_experts parameter

f285b1f

Signed-off-by: intellinjun <[email protected]>

Merge branch 'main' into mixtral

299a996

Update __init__.py

ce00092

enable q40

a698dac

Signed-off-by: intellinjun <[email protected]>

Merge branch 'mixtral' of https://github.com/intel/neural-speed into …

b2fd598

…mixtral

fix format error

ffdd11a

Signed-off-by: intellinjun <[email protected]>

enable mixtral8x7b from hf to bin

2abcb14

Signed-off-by: intellinjun <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9f56f45

for more information, see https://pre-commit.ci

fix llama load error

4a01f65

Signed-off-by: intellinjun <[email protected]>

Merge branch 'mixtral' of https://github.com/intel/neural-speed into …

38a368b

…mixtral

fix format error

28bc0db

Signed-off-by: intellinjun <[email protected]>

intellinjun requested review from airMeng, Zhenzhong1, luoyu-intel, a32543254 and DDEle February 26, 2024 00:28

airMeng reviewed Feb 26, 2024

View reviewed changes

neural_speed/convert/convert_bloom.py Outdated Show resolved Hide resolved

neural_speed/core/ne_layers.c Outdated Show resolved Hide resolved

a32543254 approved these changes Feb 26, 2024

View reviewed changes

DDEle approved these changes Feb 26, 2024

View reviewed changes

neural_speed/models/llama/llama_utils.cpp Show resolved Hide resolved

neural_speed/models/llama/llama.cpp Show resolved Hide resolved

neural_speed/models/llama/llama.cpp Outdated Show resolved Hide resolved

neural_speed/models/llama/llama.cpp Outdated Show resolved Hide resolved

a32543254 reviewed Feb 26, 2024

View reviewed changes

neural_speed/core/layers/Ops.h Show resolved Hide resolved

neural_speed/models/llama/llama.cpp Outdated Show resolved Hide resolved

Update llama.cpp

932ee25

Zhenzhong1 reviewed Feb 26, 2024

View reviewed changes

neural_speed/convert/convert_llama.py Outdated Show resolved Hide resolved

fix convert and format error

41606f6

Signed-off-by: intellinjun <[email protected]>

intellinjun and others added 3 commits February 27, 2024 09:51

Update convert_quantized_llama.py

e4260dd

add extension test for mixtral

a88ac05

Signed-off-by: intellinjun <[email protected]>

Merge branch 'mixtral' of https://github.com/intel/neural-speed into …

13999fc

…mixtral

intellinjun and others added 2 commits February 28, 2024 16:35

update argsort

1ea128e

Signed-off-by: intellinjun <[email protected]>

Update argsort.cpp

e27880c

intellinjun and others added 5 commits February 29, 2024 11:04

Update ne_layers.c

e6f2fbf

fix format error

cad4114

Signed-off-by: intellinjun <[email protected]>

Merge branch 'mixtral' of https://github.com/intel/neural-speed into …

f8cd5a2

…mixtral

fix format error

32a951d

Signed-off-by: intellinjun <[email protected]>

Update ne_layers.c

db3b00d

luoyu-intel approved these changes Feb 29, 2024

View reviewed changes

intellinjun and others added 6 commits February 29, 2024 17:44

add mul_id_ffn_fusion

7f9fcf8

Signed-off-by: intellinjun <[email protected]>

Merge branch 'mixtral' of https://github.com/intel/neural-speed into …

24156e0

…mixtral

fix compile error

7ef7a2e

Signed-off-by: intellinjun <[email protected]>

fix compile error

9e76221

Signed-off-by: intellinjun <[email protected]>

amend function name

9d6c93c

Signed-off-by: intellinjun <[email protected]>

Update convert_mixtral.py

5e474d0

intellinjun and others added 2 commits March 1, 2024 07:42

Update CMakeLists.txt

c32c991

fix mixtral_q40 multi thread error

affe69c

Signed-off-by: intellinjun <[email protected]>

airMeng approved these changes Mar 1, 2024

View reviewed changes

yuchengliu1 reviewed Mar 1, 2024

View reviewed changes

neural_speed/models/llama/llama.cpp Outdated Show resolved Hide resolved

add ffn silu support assert

590d65b

Signed-off-by: intellinjun <[email protected]>

VincyZhang merged commit 9bcb612 into main Mar 4, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Mixtral8x7B #138

Enable Mixtral8x7B #138

intellinjun commented Feb 23, 2024 •

edited

Loading

airMeng left a comment

a32543254 left a comment

intellinjun commented Feb 26, 2024

a32543254 commented Feb 26, 2024

intellinjun commented Feb 26, 2024

Zhenzhong1 commented Feb 26, 2024 •

edited

Loading

intellinjun commented Feb 27, 2024

intellinjun commented Feb 29, 2024

intellinjun commented Feb 29, 2024

intellinjun commented Feb 29, 2024

luoyu-intel commented Mar 1, 2024

intellinjun commented Mar 1, 2024

a32543254 commented Mar 1, 2024

Enable Mixtral8x7B #138

Enable Mixtral8x7B #138

Conversation

intellinjun commented Feb 23, 2024 • edited Loading

Type of Change

Description

airMeng left a comment

Choose a reason for hiding this comment

a32543254 left a comment

Choose a reason for hiding this comment

intellinjun commented Feb 26, 2024

a32543254 commented Feb 26, 2024

intellinjun commented Feb 26, 2024

Zhenzhong1 commented Feb 26, 2024 • edited Loading

intellinjun commented Feb 27, 2024

intellinjun commented Feb 29, 2024

intellinjun commented Feb 29, 2024

intellinjun commented Feb 29, 2024

luoyu-intel commented Mar 1, 2024

intellinjun commented Mar 1, 2024

a32543254 commented Mar 1, 2024

intellinjun commented Feb 23, 2024 •

edited

Loading

Zhenzhong1 commented Feb 26, 2024 •

edited

Loading