Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Enable Mixtral8x7B #138

Merged
merged 38 commits into from
Mar 4, 2024
Merged

Enable Mixtral8x7B #138

merged 38 commits into from
Mar 4, 2024

Conversation

intellinjun
Copy link
Contributor

@intellinjun intellinjun commented Feb 23, 2024

Type of Change

feature or bug fix or documentation or others
Model enabling

Description

Enable Mixtral-8x7B

  • Three new OP,mul_mat_id,argsort,top_k
  • Two hparameter n_experts and n_experts_used
  • Use GGUF format model
  • Support F32 F16 Jblas

image

Copy link
Contributor

@airMeng airMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MOE only applies to llama now?

neural_speed/convert/convert_bloom.py Outdated Show resolved Hide resolved
neural_speed/core/ne_layers.c Outdated Show resolved Hide resolved
Copy link
Contributor

@a32543254 a32543254 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

neural_speed/models/llama/llama_utils.cpp Show resolved Hide resolved
neural_speed/models/llama/llama.cpp Show resolved Hide resolved
neural_speed/models/llama/llama.cpp Outdated Show resolved Hide resolved
neural_speed/models/llama/llama.cpp Outdated Show resolved Hide resolved
neural_speed/core/layers/Ops.h Show resolved Hide resolved
neural_speed/models/llama/llama.cpp Outdated Show resolved Hide resolved
@intellinjun
Copy link
Contributor Author

MOE only applies to llama now?

MOE only applies to mixtral, mixtral is only different from ffn compared to llama

@a32543254
Copy link
Contributor

Could you post some performance data for this new model here?

@intellinjun
Copy link
Contributor Author

Could you post some performance data for this new model here?

I will add it to ci and use ci to test performance

@Zhenzhong1
Copy link
Contributor

Zhenzhong1 commented Feb 26, 2024

If the GGUF version is updated in the future, please add the GGUF format for the mixtral if possible.

Signed-off-by: intellinjun <[email protected]>
@intellinjun
Copy link
Contributor Author

@intellinjun
Copy link
Contributor Author

<style> </style>
model quant_type input_token output_token cores/instance first token next token total latency
mixtral-8x7b q4_j_b128 32 32 32 803.09 184.41 6519.8
mixtral-8x7b q4_j_b128 32 32 48 836.7 208.27 7293.07
mixtral-8x7b q4_j_b128 32 32 56 741.93 210.23 7259.06
mixtral-8x7b q4_j_b128 1024 32 32 20802.97 189.49 26677.16
mixtral-8x7b q4_j_b128 1024 32 48 18736 211.79 25301.49
mixtral-8x7b q4_j_b128 1024 32 56 17763.37 217.67 24511.14
mixtral-8x7b q4_j_b32 32 32 32 915.76 201.5 7162.26
mixtral-8x7b q4_j_b32 32 32 48 893.81 215.74 7581.75
mixtral-8x7b q4_j_b32 32 32 56 867.74 221.13 7722.77
mixtral-8x7b q4_j_b32 1024 32 32 22315 200.47 28529.57
mixtral-8x7b q4_j_b32 1024 32 48 20216.37 222.51 27114.18
mixtral-8x7b q4_j_b32 1024 32 56 20029.26 230.25 27167.01

@intellinjun
Copy link
Contributor Author

<style> </style>
model quant_type input_token output_token cores/instance first token next token total latency
mixtral-8x7b q4_j_b128 32 32 32 120.17 41.14 1395.51
mixtral-8x7b q4_j_b128 32 32 48 139.71 37.54 1303.45
mixtral-8x7b q4_j_b128 32 32 56 144.02 37.73 1313.65
mixtral-8x7b q4_j_b128 1024 32 32 1408.29 44.63 2791.82
mixtral-8x7b q4_j_b128 1024 32 48 1418.5 40 2658.5
mixtral-8x7b q4_j_b128 1024 32 56 1301.49 42.21 2610
mixtral-8x7b q4_j_b32 32 32 32 250.65 52.39 1874.74
mixtral-8x7b q4_j_b32 32 32 48 391.57 52.54 2020.31
mixtral-8x7b q4_j_b32 32 32 56 280.68 49.85 1826.03
mixtral-8x7b q4_j_b32 1024 32 32 3414.51 59.43 5256.84
mixtral-8x7b q4_j_b32 1024 32 48 2960.75 52.52 4588.87
mixtral-8x7b q4_j_b32 1024 32 56 2841.19 51.34 4432.73

@intellinjun
Copy link
Contributor Author

<style> </style>

model quant_type input_token output_token cores/instance first token next token total latency
mixtral-8x7b q4_j_b128 32 32 32 120.17 41.14 1395.51
mixtral-8x7b q4_j_b128 32 32 48 139.71 37.54 1303.45
mixtral-8x7b q4_j_b128 32 32 56 144.02 37.73 1313.65
mixtral-8x7b q4_j_b128 1024 32 32 1408.29 44.63 2791.82
mixtral-8x7b q4_j_b128 1024 32 48 1418.5 40 2658.5
mixtral-8x7b q4_j_b128 1024 32 56 1301.49 42.21 2610
mixtral-8x7b q4_j_b32 32 32 32 250.65 52.39 1874.74
mixtral-8x7b q4_j_b32 32 32 48 391.57 52.54 2020.31
mixtral-8x7b q4_j_b32 32 32 56 280.68 49.85 1826.03
mixtral-8x7b q4_j_b32 1024 32 32 3414.51 59.43 5256.84
mixtral-8x7b q4_j_b32 1024 32 48 2960.75 52.52 4588.87
mixtral-8x7b q4_j_b32 1024 32 56 2841.19 51.34 4432.73

use mul_mat_id_silu fusion

@luoyu-intel
Copy link
Contributor

<style> </style>

model quant_type input_token output_token cores/instance first token next token total latency
mixtral-8x7b q4_j_b128 32 32 32 120.17 41.14 1395.51
mixtral-8x7b q4_j_b128 32 32 48 139.71 37.54 1303.45
mixtral-8x7b q4_j_b128 32 32 56 144.02 37.73 1313.65
mixtral-8x7b q4_j_b128 1024 32 32 1408.29 44.63 2791.82
mixtral-8x7b q4_j_b128 1024 32 48 1418.5 40 2658.5
mixtral-8x7b q4_j_b128 1024 32 56 1301.49 42.21 2610
mixtral-8x7b q4_j_b32 32 32 32 250.65 52.39 1874.74
mixtral-8x7b q4_j_b32 32 32 48 391.57 52.54 2020.31
mixtral-8x7b q4_j_b32 32 32 56 280.68 49.85 1826.03
mixtral-8x7b q4_j_b32 1024 32 32 3414.51 59.43 5256.84
mixtral-8x7b q4_j_b32 1024 32 48 2960.75 52.52 4588.87
mixtral-8x7b q4_j_b32 1024 32 56 2841.19 51.34 4432.73

use mul_mat_id_silu fusion

It's SiLu that slows down the first-token inference, right?

@intellinjun
Copy link
Contributor Author

<style> </style>

model quant_type input_token output_token cores/instance first token next token total latency
mixtral-8x7b q4_j_b128 32 32 32 120.17 41.14 1395.51
mixtral-8x7b q4_j_b128 32 32 48 139.71 37.54 1303.45
mixtral-8x7b q4_j_b128 32 32 56 144.02 37.73 1313.65
mixtral-8x7b q4_j_b128 1024 32 32 1408.29 44.63 2791.82
mixtral-8x7b q4_j_b128 1024 32 48 1418.5 40 2658.5
mixtral-8x7b q4_j_b128 1024 32 56 1301.49 42.21 2610
mixtral-8x7b q4_j_b32 32 32 32 250.65 52.39 1874.74
mixtral-8x7b q4_j_b32 32 32 48 391.57 52.54 2020.31
mixtral-8x7b q4_j_b32 32 32 56 280.68 49.85 1826.03
mixtral-8x7b q4_j_b32 1024 32 32 3414.51 59.43 5256.84
mixtral-8x7b q4_j_b32 1024 32 48 2960.75 52.52 4588.87
mixtral-8x7b q4_j_b32 1024 32 56 2841.19 51.34 4432.73

use mul_mat_id_silu fusion

It's SiLu that slows down the first-token inference, right?

Silu slows down the next token inference ,first token is due to a problem with the thread setup in mul_mat_id

@a32543254
Copy link
Contributor

<style> </style>

model quant_type input_token output_token cores/instance first token next token total latency
mixtral-8x7b q4_j_b128 32 32 32 120.17 41.14 1395.51
mixtral-8x7b q4_j_b128 32 32 48 139.71 37.54 1303.45
mixtral-8x7b q4_j_b128 32 32 56 144.02 37.73 1313.65
mixtral-8x7b q4_j_b128 1024 32 32 1408.29 44.63 2791.82
mixtral-8x7b q4_j_b128 1024 32 48 1418.5 40 2658.5
mixtral-8x7b q4_j_b128 1024 32 56 1301.49 42.21 2610
mixtral-8x7b q4_j_b32 32 32 32 250.65 52.39 1874.74
mixtral-8x7b q4_j_b32 32 32 48 391.57 52.54 2020.31
mixtral-8x7b q4_j_b32 32 32 56 280.68 49.85 1826.03
mixtral-8x7b q4_j_b32 1024 32 32 3414.51 59.43 5256.84
mixtral-8x7b q4_j_b32 1024 32 48 2960.75 52.52 4588.87
mixtral-8x7b q4_j_b32 1024 32 56 2841.19 51.34 4432.73

Nice!

@VincyZhang VincyZhang merged commit 9bcb612 into main Mar 4, 2024
11 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants