Support MiniCPM3. #9322

CarryFun · 2024-09-05T12:35:36Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

convert_hf_to_gguf.py

src/llama.cpp

bioinformatist · 2024-09-12T13:33:41Z

@CarryFun Will this PR be merged these days? I'm not sure if it's blocked by #9396. If there's only problems with CI, I can help to fix them. Would you mind me submit PR to https://github.com/OpenBMB/llama.cpp/tree/minicpm3?

HanClinto · 2024-09-12T15:06:56Z

Looks like the only CI failures are the linter failing whitespace checks:

./convert_hf_to_gguf.py:1838:1: E302 expected 2 blank lines, found 1
./convert_hf_to_gguf.py:1874:1: W293 blank line contains whitespace
./convert_hf_to_gguf.py:1891:1: E302 expected 2 blank lines, found 1

convert_hf_to_gguf.py:
	1874: Trailing whitespace
src/llama.cpp:
	6968: Trailing whitespace

Once these get resolved, I imagine we can move forward on merging this in?

ggerganov · 2024-09-12T15:17:04Z

src/llama.cpp

+        const float scale_embd  = 12.0f;
+        const float scale_depth = 1.4f;


After #9412 we will have different names for these, but we can fix this later.

ggerganov · 2024-09-12T15:20:20Z

src/llama.cpp

@@ -12825,6 +12909,215 @@ struct llm_build_context {
        return gf;
    }

+    struct ggml_cgraph * build_minicpm3() {


Is this graph very similar to build_deepseek2()? Can we do some code de-duplication?

Doing a diff between this and build_deepseek2(), there is a lot that is in common between these two.

Some notable differences that I'm seeing on a first pass:

minicpm3 supports scaling at several stages -- scaling the input embeddings, scaling the hidden states near the end of each layer, and finally near the end just prior to the output. This scaling factor could be set to 1 for deepseek2 and/or skipped, and we could probably reach parity this way.

deepseek2 supports a "lite" mode that simplifies the q calculation in each layer by a decent bit. This option just would be disabled in the minicpm3 branch.

deepseek2 supports MoE / shared expert calculations for generating ffn_out, and -- like "lite" mode -- this wouldn't be needed in the minicpm3 branch.

deepseek2 does some prescaling on the kq_scale and attn_factor that minicpm3 doesn't need to do. Not sure if this could be aligned or not -- will need to dig into #7416 more before I understand this.

In calls to ggml_rope_ext, minicpm3 uses rope factors that deepseek2 simply sets to null. This could be aligned pretty easily.

And finally, the final output is calculated differently in each network -- deepseek2 gets the result output with a call to ggml_mul_mat, and minicpm3 uses a call to llm_build_lora_mm -- that may be one of the largest structural changes that I saw when comparing the two.

I think these two implementations could be aligned, but it may take a bit of refactoring of the deepseek2 code as well. Not sure how to weigh the value of this effort vs. just maintaining two separate branches of code.

My view is that the best place to look for code de-duplication is in the generic
llm_build_xxx and other similar functions that represent the basic building blocks of the models. Conditional branches in the implementation depending on the arch makes the code harder to follow and may have a higher maintenance cost than some code duplication.

Additionally, I think it would be preferable to create an abstract interface for the models, and move the implementation of each one to a separate file without any coupling between models, but that will be harder to implement if the different models share the same code in this way.

And finally, the final output is calculated differently in each network -- deepseek2 gets the result output with a call to ggml_mul_mat, and minicpm3 uses a call to llm_build_lora_mm -- that may be one of the largest structural changes that I saw when comparing the two.

That's probably a bug that will make loras that modify this tensor not work properly. llm_build_lora_mm is the right function to use when performing matrix multiplications with the weights.

HanClinto · 2024-09-13T13:51:40Z

convert_hf_to_gguf.py

@@ -1818,6 +1818,58 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter

        return [(self.map_tensor_name(name), data_torch)]

+@Model.register("MiniCPM3ForCausalLM")


Suggested change

@Model.register("MiniCPM3ForCausalLM")

@Model.register("MiniCPM3ForCausalLM")

Lint is currently breaking on this -- need to add an additional blank line above your class definition.

Thank you for your comments.

HanClinto · 2024-09-13T13:52:11Z

convert_hf_to_gguf.py

+            weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
+            .swapaxes(1, 2)
+            .reshape(weights.shape)
+        )


Suggested change

)

)

Lint is also breaking on this -- need to add an additional blank line below your class definition as well.

Thank you for your comments.

HanClinto · 2024-09-15T04:24:30Z

LGTM -- all CI is passing now!

There are some refactoring optimizations that can probably come later, but for now I think this is probably good enough for getting the new model added.

Anyone else willing to weigh in on it? I'm not so confident that I'm willing to mark as approved on my own.

ggerganov · 2024-09-15T07:53:39Z

@HanClinto Thanks, will merge soon. Just want to first fix the Docker CI on master before pushing big changes.

Co-authored-by: 范睿凯 <[email protected]>

github-actions bot added the python python script changes label Sep 5, 2024

compilade reviewed Sep 5, 2024

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

src/llama.cpp Outdated Show resolved Hide resolved

src/llama.cpp Outdated Show resolved Hide resolved

src/llama.cpp Outdated Show resolved Hide resolved

CarryFun force-pushed the minicpm3 branch from 3022e01 to 5f9c6fb Compare September 6, 2024 03:10

compilade mentioned this pull request Sep 10, 2024

convert : refactor rope_freqs generation #9396

Merged

12 tasks

This was referenced Sep 10, 2024

Error loading model architecture for miniCPM3-4B: Unknown architecture 'minicpm3' ollama/ollama#6721

Closed

MiniCPM3 support ollama/ollama#6722

Open

ggerganov reviewed Sep 12, 2024

View reviewed changes

This was referenced Sep 13, 2024

[Bad Case]: MIniCPM3 原始pytorch .bin文件转为gguf失败 OpenBMB/MiniCPM#212

Closed

[Feature Request]: Need Docker image for OpenBMB/llama.cpp OpenBMB/MiniCPM#231

Closed

CarryFun force-pushed the minicpm3 branch from 5f9c6fb to b90471c Compare September 13, 2024 12:39

HanClinto reviewed Sep 13, 2024

View reviewed changes

Support MiniCPM3.

f83e9c9

CarryFun force-pushed the minicpm3 branch from b90471c to f83e9c9 Compare September 14, 2024 02:27

ggerganov approved these changes Sep 16, 2024

View reviewed changes

ggerganov merged commit 95ca851 into ggerganov:master Sep 16, 2024
54 checks passed

tc-mb deleted the minicpm3 branch September 29, 2024 07:22

wapleeeeee mentioned this pull request Oct 23, 2024

Need to update the version of llama.cpp TabbyML/tabby#3305

Closed

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

llama : support MiniCPM3 (ggerganov#9322)

7d0eb42

Co-authored-by: 范睿凯 <[email protected]>

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llama : support MiniCPM3 (ggerganov#9322)

4ef14f0

Co-authored-by: 范睿凯 <[email protected]>

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llama : support MiniCPM3 (ggerganov#9322)

fe24a70

Co-authored-by: 范睿凯 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MiniCPM3. #9322

Support MiniCPM3. #9322

CarryFun commented Sep 5, 2024

bioinformatist commented Sep 12, 2024

HanClinto commented Sep 12, 2024

ggerganov Sep 12, 2024

ggerganov Sep 12, 2024

HanClinto Sep 12, 2024

slaren Sep 12, 2024

slaren Sep 12, 2024 •

edited

Loading

HanClinto Sep 13, 2024 •

edited

Loading

CarryFun Sep 14, 2024

HanClinto Sep 13, 2024

CarryFun Sep 14, 2024

HanClinto commented Sep 15, 2024

ggerganov commented Sep 15, 2024

		const float scale_embd = 12.0f;
		const float scale_depth = 1.4f;

		@@ -1818,6 +1818,58 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int \| None) -> Iter

		return [(self.map_tensor_name(name), data_torch)]

		@Model.register("MiniCPM3ForCausalLM")

Support MiniCPM3. #9322

Support MiniCPM3. #9322

Conversation

CarryFun commented Sep 5, 2024

bioinformatist commented Sep 12, 2024

HanClinto commented Sep 12, 2024

ggerganov Sep 12, 2024

Choose a reason for hiding this comment

ggerganov Sep 12, 2024

Choose a reason for hiding this comment

HanClinto Sep 12, 2024

Choose a reason for hiding this comment

slaren Sep 12, 2024

Choose a reason for hiding this comment

slaren Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

HanClinto Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

CarryFun Sep 14, 2024

Choose a reason for hiding this comment

HanClinto Sep 13, 2024

Choose a reason for hiding this comment

CarryFun Sep 14, 2024

Choose a reason for hiding this comment

HanClinto commented Sep 15, 2024

ggerganov commented Sep 15, 2024

slaren Sep 12, 2024 •

edited

Loading

HanClinto Sep 13, 2024 •

edited

Loading