MUSA: support ARM64 and enable dp4a .etc #11843

BodhiHu · 2025-02-13T08:30:49Z

This PR will do:

enable dp4a on MUSA;
fix compile errors on MUSA ARM64;
support sparse MoE param expert_weights_scale for MoE sparsified LLaMA models;

Tested with following models:

ARM64:

MUSA SDK: 3.1.2
CPU compiler: clang-17

qwen2.5-1.5b-instruct-q8_0.gguf
qwen2.5-3b-instruct-q4_k_m.gguf
deepseek-r1-7B-Q4_K_M.gguf

x86:

MUSA SDK: 3.1.1
CPU compiler: clang-14

llama3_8b_q4_0.gguf
deepseek-r1_7b_q4_0.gguf
qwen2.5-3b-instruct-q4_k_m.gguf

…pp into bodhi/smoe+musa-ups

BodhiHu · 2025-02-13T09:58:05Z

Hi @JohannesGaessler , @ggerganov , @slaren , @yeahdongcn ,

Can you please help review this PR ?

Thanks a lot.

JohannesGaessler

The changes to the CUDA backend look fine to me other than the things I commented on. I don't know whether the changes for model support are correct.

CMakeLists.txt

ggml/src/ggml-cuda/common.cuh

ggml/src/ggml-cuda/ggml-cuda.cu

ggml/src/ggml-cuda/mmq.cu

Co-authored-by: Johannes Gäßler <[email protected]>

yeahdongcn · 2025-02-13T11:25:16Z

Please run the functionality tests and the tests under the tests directory on amd64 as well.
BTW, I'm updating the MUSA SDK version to rc3.1.1. You may want to hold off until #11822 is reviewed and merged.

BodhiHu · 2025-02-14T07:30:51Z

The changes to the CUDA backend look fine to me other than the things I commented on. I don't know whether the changes for model support are correct.

Hi @JohannesGaessler , the changes to model support is to enable the expert_weights_scale for MoE sparsified LLaMA models,
I tested with following LLaMA MoE model, and it runs well:

https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-2_8-sft

BodhiHu · 2025-02-14T09:32:44Z

Please run the functionality tests and the tests under the tests directory on amd64 as well. BTW, I'm updating the MUSA SDK version to rc3.1.1. You may want to hold off until #11822 is reviewed and merged.

Hi @yeahdongcn , I see #11822 had been merged.

When running ./build/bin/test-backend-ops, there's an exception, don't know if this also happens on your side or is an known issue ?

  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=bf16,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0,permute=[0,1,2,3]): not supported [MUSA0]
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): MUSA error: invalid argument
  current device: 0, in function ggml_cuda_cross_entropy_loss at /home/mm/bodhi/llama.cpp/ggml/src/ggml-cuda/cross-entropy-loss.cu:129
  musaFuncSetAttribute(cross_entropy_loss_back_f32<true>, musaFuncAttributeMaxDynamicSharedMemorySize, smpbo)
/home/mm/bodhi/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:73: MUSA error
[New LWP 178917]
[New LWP 178918]
[New LWP 178919]
[New LWP 178920]
[New LWP 178933]
[New LWP 178982]
[New LWP 179583]
[New LWP 179584]
[New LWP 179585]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffff8d436800 in __GI___wait4 (pid=<optimized out>, stat_loc=0xffffc4a3e86c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x0000ffff8d436800 in __GI___wait4 (pid=<optimized out>, stat_loc=0xffffc4a3e86c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000aaaac10a8d44 in ggml_print_backtrace ()
#2  0x0000aaaac10a8cd8 in ggml_abort ()
#3  0x0000aaaac0f6c8cc in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#4  0x0000aaaac1054b2c in ggml_cuda_cross_entropy_loss(ggml_backend_cuda_context&, ggml_tensor*) ()
#5  0x0000aaaac0f715bc in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#6  0x0000aaaac10bf404 in ggml_backend_compare_graph_backend ()
#7  0x0000aaaac0ee4e78 in test_case::eval(ggml_backend*, ggml_backend*, char const*) ()
#8  0x0000aaaac0ed1f14 in main ()
[Inferior 1 (process 178916) detached]
Aborted (core dumped)

ggml/src/ggml-cuda/cross-entropy-loss.cu

BodhiHu · 2025-02-14T12:35:17Z

Please run the functionality tests and the tests under the tests directory on amd64 as well. BTW, I'm updating the MUSA SDK version to rc3.1.1. You may want to hold off until #11822 is reviewed and merged.

Hi @yeahdongcn , I see #11822 had been merged.

When running ./build/bin/test-backend-ops, there's an exception, don't know if this also happens on your side or is an known issue ?

  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=bf16,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0,permute=[0,1,2,3]): not supported [MUSA0]
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): MUSA error: invalid argument
  current device: 0, in function ggml_cuda_cross_entropy_loss at /home/mm/bodhi/llama.cpp/ggml/src/ggml-cuda/cross-entropy-loss.cu:129
  musaFuncSetAttribute(cross_entropy_loss_back_f32<true>, musaFuncAttributeMaxDynamicSharedMemorySize, smpbo)
/home/mm/bodhi/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:73: MUSA error
[New LWP 178917]
[New LWP 178918]
[New LWP 178919]
[New LWP 178920]
[New LWP 178933]
[New LWP 178982]
[New LWP 179583]
[New LWP 179584]
[New LWP 179585]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffff8d436800 in __GI___wait4 (pid=<optimized out>, stat_loc=0xffffc4a3e86c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x0000ffff8d436800 in __GI___wait4 (pid=<optimized out>, stat_loc=0xffffc4a3e86c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000aaaac10a8d44 in ggml_print_backtrace ()
#2  0x0000aaaac10a8cd8 in ggml_abort ()
#3  0x0000aaaac0f6c8cc in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#4  0x0000aaaac1054b2c in ggml_cuda_cross_entropy_loss(ggml_backend_cuda_context&, ggml_tensor*) ()
#5  0x0000aaaac0f715bc in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#6  0x0000aaaac10bf404 in ggml_backend_compare_graph_backend ()
#7  0x0000aaaac0ee4e78 in test_case::eval(ggml_backend*, ggml_backend*, char const*) ()
#8  0x0000aaaac0ed1f14 in main ()
[Inferior 1 (process 178916) detached]
Aborted (core dumped)

FYI，the above CROSS_ENTROPY_LOSS op test error had been fixed.

ggml/src/ggml-cuda/common.cuh

ggml/src/ggml-cuda/cross-entropy-loss.cu

ggml/src/ggml-cuda/mmq.cu

convert_hf_to_gguf.py

BodhiHu · 2025-02-18T02:55:29Z

Hi @yeahdongcn , the model running issue had been fixed on x86,
tested with following models and it runs well now:

llama3_8b_q4_0.gguf
deepseek-r1_7b_q4_0.gguf
qwen2.5-3b-instruct-q4_k_m.gguf

BodhiHu · 2025-02-18T03:09:30Z

Hi @slaren , the LLaMA-MoE changes to convert_hf_to_gguf.py had been removed, can you please help review again ? Thanks.

docs/build.md

ggml/src/ggml-cuda/ggml-cuda.cu

ggml/src/ggml-cuda/common.cuh

src/llama-model.cpp

Huaishun Hu added 15 commits December 16, 2024 14:42

fix musa build on aarch64

5d46c48

handle expert_weights_scale in llama MoE

10e2581

fix convert scripts

6064a11

add ggml op perf logs

7420f98

perf mul_mat branch stats

ca3ef2a

partially adopt musa cc to match cuda cc

d69103a

refine musa cc

0cbec77

Merge branch 'upstream/master' into bodhi/smoe+musa

2ea567c

add expert_weights_scale to llama

e3c0041

fixme: musa cc had to fallback to fix inference

1cc5037

Merge branch 'upstream/master' into bodhi/smoe+musa-ups

d968948

Merge branch 'bodhi/smoe+musa-ups' of sh-code.mthreads.com:sw/llama_c…

423d0a7

…pp into bodhi/smoe+musa-ups

Merge branch 'upstream/master' into bodhi/smoe+musa-ups

86d33ac

musa: ggml_cuda_get_physical_warp_size should return 128 for musa

6afb592

musa: fix __dp4a incorrect result

efd21bf

BodhiHu requested a review from JohannesGaessler as a code owner February 13, 2025 08:30

github-actions bot added documentation Improvements or additions to documentation build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Feb 13, 2025

Huaishun Hu and others added 6 commits February 13, 2025 16:37

cleanup unused comments

c0478ad

fix: should not include arm_neon.h when compiling on musa arm64

2030e59

remove the op profiling code

24c039b

update build doc

140d6a0

Merge branch 'master' into musa

f17d4d3

update musa build doc

19cf45e

BodhiHu changed the title ~~[wip] MUSA: enable dp4a and fix compile errors on ARM64~~ MUSA: enable dp4a and fix compile errors on ARM64 Feb 13, 2025

JohannesGaessler reviewed Feb 13, 2025

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/mmq.cu Outdated Show resolved Hide resolved

BodhiHu and others added 3 commits February 13, 2025 19:15

Update ggml/src/ggml-cuda/mmq.cu

1c57fd8

Co-authored-by: Johannes Gäßler <[email protected]>

Update CMakeLists.txt

62c2d7e

Co-authored-by: Johannes Gäßler <[email protected]>

INT8_MMA_AVAILABLE should be NEW_MMA_AVAILABLE

1534955

Co-authored-by: Johannes Gäßler <[email protected]>

BodhiHu commented Feb 14, 2025

View reviewed changes

ggml/src/ggml-cuda/cross-entropy-loss.cu Show resolved Hide resolved

BodhiHu closed this Feb 14, 2025

BodhiHu reopened this Feb 14, 2025

Bodhi Hu added 2 commits February 14, 2025 20:24

fix cross entropy loss op test

e58ed0d

Merge branch 'musa' of github.com:BodhiHu/llama.cpp into musa

4838701

yeahdongcn reviewed Feb 15, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/cross-entropy-loss.cu Show resolved Hide resolved

ggml/src/ggml-cuda/mmq.cu Outdated Show resolved Hide resolved

BodhiHu changed the title ~~MUSA: enable dp4a and fix compile errors on ARM64~~ MUSA: support ARM64 and enable dp4a .etc Feb 17, 2025

slaren requested changes Feb 17, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

Bodhi Hu added 2 commits February 17, 2025 10:49

should not disable MMA for musa

9333e73

docs: use the builtin clang compiler for musa

2842362

Bodhi Hu added 2 commits February 18, 2025 10:40

fix musa inference on x86

cb0f488

remove the LLaMA-MoE model scale factor proccessing for this PR

8a68656

BodhiHu requested review from slaren and yeahdongcn February 19, 2025 02:20

yeahdongcn reviewed Feb 19, 2025

View reviewed changes

docs/build.md Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

BodhiHu force-pushed the musa branch from b4839fc to f24378f Compare February 19, 2025 07:35

BodhiHu requested a review from yeahdongcn February 19, 2025 07:36

BodhiHu force-pushed the musa branch from f24378f to edc1630 Compare February 19, 2025 07:44

update

edc1630

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MUSA: support ARM64 and enable dp4a .etc #11843

MUSA: support ARM64 and enable dp4a .etc #11843

BodhiHu commented Feb 13, 2025 •

edited

Loading

BodhiHu commented Feb 13, 2025

JohannesGaessler left a comment

yeahdongcn commented Feb 13, 2025 •

edited

Loading

BodhiHu commented Feb 14, 2025 •

edited

Loading

BodhiHu commented Feb 14, 2025

BodhiHu commented Feb 14, 2025

BodhiHu commented Feb 18, 2025

BodhiHu commented Feb 18, 2025 •

edited

Loading

MUSA: support ARM64 and enable dp4a .etc #11843

Are you sure you want to change the base?

MUSA: support ARM64 and enable dp4a .etc #11843

Conversation

BodhiHu commented Feb 13, 2025 • edited Loading

BodhiHu commented Feb 13, 2025

JohannesGaessler left a comment

Choose a reason for hiding this comment

yeahdongcn commented Feb 13, 2025 • edited Loading

BodhiHu commented Feb 14, 2025 • edited Loading

BodhiHu commented Feb 14, 2025

BodhiHu commented Feb 14, 2025

BodhiHu commented Feb 18, 2025

BodhiHu commented Feb 18, 2025 • edited Loading

BodhiHu commented Feb 13, 2025 •

edited

Loading

yeahdongcn commented Feb 13, 2025 •

edited

Loading

BodhiHu commented Feb 14, 2025 •

edited

Loading

BodhiHu commented Feb 18, 2025 •

edited

Loading