-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support MiniCPM3. #9322
Support MiniCPM3. #9322
Conversation
CarryFun
commented
Sep 5, 2024
- I have read the contributing guidelines
- Self-reported review complexity:
- Low
- Medium
- High
@CarryFun Will this PR be merged these days? I'm not sure if it's blocked by #9396. If there's only problems with CI, I can help to fix them. Would you mind me submit PR to https://github.com/OpenBMB/llama.cpp/tree/minicpm3? |
Looks like the only CI failures are the linter failing whitespace checks:
Once these get resolved, I imagine we can move forward on merging this in? |
const float scale_embd = 12.0f; | ||
const float scale_depth = 1.4f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #9412 we will have different names for these, but we can fix this later.
@@ -12825,6 +12909,215 @@ struct llm_build_context { | |||
return gf; | |||
} | |||
|
|||
struct ggml_cgraph * build_minicpm3() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this graph very similar to build_deepseek2()
? Can we do some code de-duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a diff between this and build_deepseek2()
, there is a lot that is in common between these two.
Some notable differences that I'm seeing on a first pass:
minicpm3
supports scaling at several stages -- scaling the input embeddings, scaling the hidden states near the end of each layer, and finally near the end just prior to the output. This scaling factor could be set to 1 for deepseek2
and/or skipped, and we could probably reach parity this way.
deepseek2
supports a "lite" mode that simplifies the q calculation in each layer by a decent bit. This option just would be disabled in the minicpm3
branch.
deepseek2
supports MoE / shared expert calculations for generating ffn_out
, and -- like "lite" mode -- this wouldn't be needed in the minicpm3
branch.
deepseek2
does some prescaling on the kq_scale and attn_factor that minicpm3
doesn't need to do. Not sure if this could be aligned or not -- will need to dig into #7416 more before I understand this.
In calls to ggml_rope_ext
, minicpm3
uses rope factors that deepseek2
simply sets to null. This could be aligned pretty easily.
And finally, the final output is calculated differently in each network -- deepseek2
gets the result output with a call to ggml_mul_mat
, and minicpm3
uses a call to llm_build_lora_mm
-- that may be one of the largest structural changes that I saw when comparing the two.
I think these two implementations could be aligned, but it may take a bit of refactoring of the deepseek2 code as well. Not sure how to weigh the value of this effort vs. just maintaining two separate branches of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My view is that the best place to look for code de-duplication is in the generic
llm_build_xxx
and other similar functions that represent the basic building blocks of the models. Conditional branches in the implementation depending on the arch makes the code harder to follow and may have a higher maintenance cost than some code duplication.
Additionally, I think it would be preferable to create an abstract interface for the models, and move the implementation of each one to a separate file without any coupling between models, but that will be harder to implement if the different models share the same code in this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And finally, the final output is calculated differently in each network --
deepseek2
gets the result output with a call toggml_mul_mat
, andminicpm3
uses a call tollm_build_lora_mm
-- that may be one of the largest structural changes that I saw when comparing the two.
That's probably a bug that will make loras that modify this tensor not work properly. llm_build_lora_mm
is the right function to use when performing matrix multiplications with the weights.
@@ -1818,6 +1818,58 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter | |||
|
|||
return [(self.map_tensor_name(name), data_torch)] | |||
|
|||
@Model.register("MiniCPM3ForCausalLM") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Model.register("MiniCPM3ForCausalLM") | |
@Model.register("MiniCPM3ForCausalLM") |
Lint is currently breaking on this -- need to add an additional blank line above your class definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments.
weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:]) | ||
.swapaxes(1, 2) | ||
.reshape(weights.shape) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | |
) | |
Lint is also breaking on this -- need to add an additional blank line below your class definition as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments.
LGTM -- all CI is passing now! There are some refactoring optimizations that can probably come later, but for now I think this is probably good enough for getting the new model added. Anyone else willing to weigh in on it? I'm not so confident that I'm willing to mark as approved on my own. |
@HanClinto Thanks, will merge soon. Just want to first fix the Docker CI on |
Co-authored-by: 范睿凯 <[email protected]>
Co-authored-by: 范睿凯 <[email protected]>
Co-authored-by: 范睿凯 <[email protected]>