Any plans to merge the latest code of llama.cpp? #24

peytoncai · 2024-08-20T12:27:42Z

Qwen2

warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 2854 (70c312d)
main: built with clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) for x86_64-unknown-linux-gnu
main: seed = 1724130565
[13:09:25] /aaaa/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init
llama_model_loader: loaded meta data with 20 key-value pairs and 386 tensors from /aaaa/Qwen1.5-0.5B-Chat-GPTQ-Int4/ggml-model.in.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = Qwen1.5-0.5B-Chat-GPTQ-Int4
llama_model_loader: - kv 2: qwen2.block_count u32 = 24
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1024
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 2816
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 32
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - type f32: 217 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type i4: 168 tensors
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model ' /aaaa/Qwen1.5-0.5B-Chat-GPTQ-Int4/ggml-model.in.gguf'
main: error: unable to load model

gemma2

Running STEP.0: Compile kernels
Running command in /aaaa/T-MAC/deploy:
python compile.py -o tuned -da -nt 4 -tb -gc -gs 128 -ags 64 -t -m gptq-auto -md /aaaa/gemma-2-9b-it-gptq-4bit
Running STEP.1: Build T-MAC C++ CMakeFiles
Running command in /aaaa/T-MAC/build:
cmake -DCMAKE_INSTALL_PREFIX=/aaaa/T-MAC/install ..
Running STEP.2: Install T-MAC C++
Running command in /aaaa/T-MAC/build:
cmake --build . --target install --config Release
Running STEP.3: Convert HF to GGUF
Running command in /aaaa/T-MAC/3rdparty/llama.cpp:
python convert-hf-to-gguf-t-mac.py /aaaa/gemma-2-9b-it-gptq-4bit --outtype in --outfile /aaaa/gemma-2-9b-it-gptq-4bit/ggml-model.in.gguf --kcfg /aaaa/T-MAC/install/lib/kcfg.ini
Please check logs/2024-08-20-15-29-20.log for what's wrong
(tmac) root@4c5e2a287200:/aaaa/T-MAC# cat logs/2024-08-20-15-29-20.log
INFO:hf-to-gguf:Loading model: gemma-2-9b-it-gptq-4bit
Traceback (most recent call last):
File "convert-hf-to-gguf-t-mac.py", line 3421, in
main()
File "convert-hf-to-gguf-t-mac.py", line 3399, in main
model_class = Model.from_model_architecture(hparams["architectures"][0])
File "convert-hf-to-gguf-t-mac.py", line 318, in from_model_architecture
raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'Gemma2ForCausalLM' not supported!

Tasks

Give feedback

Update llama.cpp
Options

kaleid-liner · 2024-08-20T12:56:58Z

We are working on it. llama.cpp is evolving very fast with a lot of refactoring here and there, so it won't be very quick.

nctu6 · 2024-09-13T03:34:16Z

https://github.com/nctu6/llama.cpp/commits/t-mac/

I have merged a version that includes all changes from your llama.cpp repository into the latest llama.cpp.
It can be built and run successfully on the Ubuntu platform.

I hope this helps.
Thank you for your work.

Regards.

knyipab · 2024-09-15T10:27:36Z

Exciting work and thread!

T-MAC focus on edge device and it would be very meaningful to be merged to llama.cpp provided its significance in open LLM software. I packaged ollama to termux's TUR (pkg install tur-repo && pkg update && pkg install -y ollama will do). Once T-MAC is merged to llama.cpp or become part of llama.cpp, it can be tested and accessible to Android device and Termux users widely.

kaleid-liner · 2024-09-16T13:38:42Z

https://github.com/nctu6/llama.cpp/commits/t-mac/

I have merged a version that includes all changes from your llama.cpp repository into the latest llama.cpp. It can be built and run successfully on the Ubuntu platform.

I hope this helps. Thank you for your work.

Regards.

@nctu6 Fantastic work! Let me share some updates on our progress.

After wrapping up some paper-related tasks, I've managed to spare some time to work on the merge. I'm building on the merge codebase by @QingtaoLi1 (https://github.com/QingtaoLi1/llama.cpp/pull/4/files). Your version has been incredibly helpful, and I will dig into the code to see what insights I can glean.

Beyond the merge, I'm also working on some refactoring to prepare a clean pull request to the main llama.cpp repo. It will focus on:

Remove unused legacy code
Refactor the multi-threading scheduling in ggml.c to be consistent with the current dynamic OpenMP scheduling
Refactor the convert script (convert_hf_to_gguf.py), to trim redundant code and ensure it applies to all models based on the Model base class.

Your code will be invaluable for our functionality testing. I'll keep you informed once the merge is complete, which might take a few days.

Thanks again for your fantastic contribution!

kaleid-liner · 2024-09-16T13:49:57Z

Exciting work and thread!

T-MAC focus on edge device and it would be very meaningful to be merged to llama.cpp provided its significance in open LLM software. I packaged ollama to termux's TUR (pkg install tur-repo && pkg update && pkg install -y ollama will do). Once T-MAC is merged to llama.cpp or become part of llama.cpp, it can be tested and accessible to Android device and Termux users widely.

@knyipab Thanks! After the merge and some necessary refactoring, I will open a pull request to llama.cpp and hopefully t-mac can be merged as soon as possible. Excellent work to simplify the deployment on Android! It will also be a good demo of t-mac and I will test it. I will also publish the optimized performance data on Android after merging the openmp optimization.

BodhiHu · 2024-09-24T03:18:39Z

https://github.com/nctu6/llama.cpp/commits/t-mac/

I have merged a version that includes all changes from your llama.cpp repository into the latest llama.cpp. It can be built and run successfully on the Ubuntu platform.

I hope this helps. Thank you for your work.

Regards.

Hello,
Have you tested your code ? The task type had been removed from the upstream llama.cpp:

nctu6/llama.cpp@c03d69c#diff-f028a352a33ee20b42faca7dcc389e8f0f9c9a55e016cccffed45fe90bcc13f8R12967

I pulled your code and it failed to compile with errors:

ggml.c:12987:16: error: no member named 'type' in 'struct ggml_compute_params'
 12987 |                         if (params->type == GGML_TASK_TYPE_INIT) {

nctu6 · 2024-09-24T03:47:22Z

Hi,

This merge is based on the master branch, and the type was removed from the official master branch three months ago, as indicated in the commit below.

ggerganov/llama.cpp@95f57bb#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5d

Regards.

qw1319 · 2024-09-24T08:13:23Z

https://github.com/nctu6/llama.cpp/commits/t-mac/

I have merged a version that includes all changes from your llama.cpp repository into the latest llama.cpp. It can be built and run successfully on the Ubuntu platform.

I hope this helps. Thank you for your work.

Regards.

i do not have see t-mac/tmac_gemm_wrapper.h and kernel.cc kernel.h, where are they? and i see in your cmake, option(LLAMA_TMAC "llama: use TMAC" OFF), do you not compile tmac?

qw1319 · 2024-09-24T10:12:23Z

Hi,

This merge is based on the master branch, and the type was removed from the official master branch three months ago, as indicated in the commit below.

ggerganov/llama.cpp@95f57bb#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5d

Regards.

I meet the some question with u , do u have resolve it?
when i reset to commit id 95f57bb and cherry-pick your merge commit c03d69c6818b0bbac06efcb025ea892d9d9ef90a , i will meet many merge conflict。。。

nctu6 · 2024-09-24T11:38:58Z

Hello,

Please refer to the following link for the T-MAC build instructions:
T-MAC Build Guide

Example from the official README:

Running STEP.4: Build llama.cpp CMakeFiles
  Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
    cmake .. -DLLAMA_TMAC=ON -DCMAKE_PREFIX_PATH=/Users/user/jianyu/T-MAC/install/lib/cmake/t-mac -DCMAKE_BUILD_TYPE=Release -DLLAMA_LLAMAFILE_DEFAULT=OFF -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++

The purpose is to integrate T-MAC into the latest version of llama.cpp (master branch).
Therefore, the commit I used was the most recent at that time.
(Commit: 449ccfb6f5f1bbd70e04f75a330d9d7c1af82187)

Best regards

qw1319 · 2024-09-25T01:33:50Z

Hello,
Please refer to the following link for the T-MAC build instructions:
T-MAC Build Guide
Example from the official README:
Running STEP.4: Build llama.cpp CMakeFiles
  Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
    cmake .. -DLLAMA_TMAC=ON -DCMAKE_PREFIX_PATH=/Users/user/jianyu/T-MAC/install/lib/cmake/t-mac -DCMAKE_BUILD_TYPE=Release -DLLAMA_LLAMAFILE_DEFAULT=OFF -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
The purpose is to integrate T-MAC into the latest version of llama.cpp (master branch).
Therefore, the commit I used was the most recent at that time.
(Commit: 449ccfb6f5f1bbd70e04f75a330d9d7c1af82187)
Best regards

First: i use this commit 449ccfb6f5f1bbd70e04f75a330d9d7c1af82187, and cherry-pick your merge commit c03d69c6818b0bbac06efcb025ea892d9d9ef90a, and compile like u
but i meet this error `CMake Error at ggml/src/CMakeLists.txt:1344 (add_library):
Cannot find source file:

ggml-tmac.h`

then, i fix cmake error by replace set(GGML_HEADERS_TMAC ../include/ggml-tmac.h)
But i meet another error /llama.cpp/ggml/src/ggml.c:12967:16: error: no member named 'type' in 'struct ggml_compute_params' 12967 | if (params->type == GGML_TASK_TYPE_FINALIZE) { | ~~~~~~ ^ /llama.cpp/ggml/src/ggml.c:12967:24: error: use of undeclared identifier 'GGML_TASK_TYPE_FINALIZE' 12967 | if (params->type == GGML_TASK_TYPE_FINALIZE) { /llama.cpp/ggml/src/ggml.c:12987:16: error: no member named 'type' in 'struct ggml_compute_params' 12987 | if (params->type == GGML_TASK_TYPE_INIT) { /llama.cpp/ggml/src/ggml.c:12987:24: error: use of undeclared identifier 'GGML_TASK_TYPE_INIT' 12987 | if (params->type == GGML_TASK_TYPE_INIT) {

nctu6 · 2024-09-25T06:36:02Z

Hi,
You don't need to change anything.
What I do is simply use the build script run_pipeline.py provided by T-MAC.
Everything works fine when calling run_pipeline.py.
Regards.

qw1319 · 2024-09-25T08:23:05Z

Hi, You don't need to change anything. What I do is simply use the build script run_pipeline.py provided by T-MAC. Everything works fine when calling run_pipeline.py. Regards.

i use run_pipeline.py too. but i meet question
`-- The C compiler identification is Clang 17.0.6
-- The CXX compiler identification is Clang 17.0.6
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /002data/alexjiao/T-MAC/build/clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /002data/alexjiao/T-MAC/build/clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.25.1")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- OpenMP found
-- Using llamafile
-- TMAC found
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done
CMake Error at ggml/src/CMakeLists.txt:1344 (add_library):
Cannot find source file:

ggml-tmac.h

Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm .h
.hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .hip .ispc

CMake Error at ggml/src/CMakeLists.txt:1344 (add_library):
No SOURCES given to target: ggml

CMake Generate step failed. Build files cannot be regenerated correctly.`

nctu6 · 2024-09-26T09:49:40Z

Thank you for the verification.
I will set up a new environment and test it again to identify the issue.
Regards.

* [WIP] Merge latest llama.cpp * Adapt scripts to latest llama.cpp * Fix run_pipe.py cmake error * Attempt to optimize performance on arm cpus * Support armv8.7a+ cpus * Finish merging and rebasing llama.cpp

kaleid-liner · 2024-10-10T13:02:16Z

@nctu6 @peytoncai @qw1319 @BodhiHu Sorry for the delayed update. After fixing several performance related issues, we have finally updated the llama.cpp version (#46). Now you can test qwen2 using https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4. If you encounter any error, feel free to open new issues.

qw1319 · 2024-10-11T06:11:21Z

Have anyone meet this question? when i ues this(new llama.cpp version #46)
I meet this question on android
CANNOT LINK EXECUTABLE "/data/local/tmp/bin/llama-cli": library "libllama.so" not found: needed by main executable

This question is llama.cpp change andorid use method;
you may need to update your readme & run-pipeline.py

qw1319 · 2024-10-11T09:17:34Z

But I compile android kernel like llama.cpp(https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md) with -DGGML_TMAC=ON -DCMAKE_PREFIX_PATH=/our_path, i get worse perormance compare to previous version

kaleid-liner · 2024-10-16T07:19:47Z

But I compile android kernel like llama.cpp(https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md) with -DGGML_TMAC=ON -DCMAKE_PREFIX_PATH=/our_path, i get worse perormance compare to previous version

I achieved the same performance as the previous version, and the performance is now much more stable. However, I'm curious why the performance on 8GEN3 hasn't improved like it has on other devices (such as M2-Ultra and Surface Laptop 7). It should have benefited from dynamic dispatch. I'm sparing time to investigate the cause.

kaleid-liner · 2024-10-16T07:24:26Z

Have anyone meet this question? when i ues this(new llama.cpp version #46) I meet this question on android CANNOT LINK EXECUTABLE "/data/local/tmp/bin/llama-cli": library "libllama.so" not found: needed by main executable

This question is llama.cpp change andorid use method; you may need to update your readme & run-pipeline.py

Sorry, but I didn't encounter this issue.

qw1319 · 2024-10-18T06:39:01Z

Have anyone meet this question? when i ues this(new llama.cpp version #46) I meet this question on android CANNOT LINK EXECUTABLE "/data/local/tmp/bin/llama-cli": library "libllama.so" not found: needed by main executable
This question is llama.cpp change andorid use method; you may need to update your readme & run-pipeline.py

Sorry, but I didn't encounter this issue.

like this(https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md)

qw1319 · 2024-10-31T03:50:55Z

But I compile android kernel like llama.cpp(https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md) with -DGGML_TMAC=ON -DCMAKE_PREFIX_PATH=/our_path, i get worse perormance compare to previous version

I achieved the same performance as the previous version, and the performance is now much more stable. However, I'm curious why the performance on 8GEN3 hasn't improved like it has on other devices (such as M2-Ultra and Surface Laptop 7). It should have benefited from dynamic dispatch. I'm sparing time to investigate the cause.

This may be tuned env is not compatibled with runtime env(1.8Gen3:have Big core, middle core and efficient core; 2.tvm tune may not mult thread pool balanced scheduling，runtime env(llama.cpp) use threadpool to balance mult thread schedule)

This was referenced Aug 22, 2024

【Qwen】Could you please update 3rd/llama.cpp to support Qwen1.5 or Qwen2 ? #27

Closed

OpenAI compatible chat completions endpoint #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any plans to merge the latest code of llama.cpp? #24

Any plans to merge the latest code of llama.cpp? #24

peytoncai commented Aug 20, 2024 •

edited by QingtaoLi1

Loading

Tasks

kaleid-liner commented Aug 20, 2024

nctu6 commented Sep 13, 2024 •

edited

Loading

knyipab commented Sep 15, 2024

kaleid-liner commented Sep 16, 2024 •

edited

Loading

kaleid-liner commented Sep 16, 2024

BodhiHu commented Sep 24, 2024

nctu6 commented Sep 24, 2024

qw1319 commented Sep 24, 2024

qw1319 commented Sep 24, 2024 •

edited

Loading

nctu6 commented Sep 24, 2024

qw1319 commented Sep 25, 2024

nctu6 commented Sep 25, 2024

qw1319 commented Sep 25, 2024 •

edited

Loading

nctu6 commented Sep 26, 2024

kaleid-liner commented Oct 10, 2024

qw1319 commented Oct 11, 2024 •

edited

Loading

qw1319 commented Oct 11, 2024 •

edited

Loading

kaleid-liner commented Oct 16, 2024

kaleid-liner commented Oct 16, 2024

qw1319 commented Oct 18, 2024

qw1319 commented Oct 31, 2024

Any plans to merge the latest code of llama.cpp? #24

Any plans to merge the latest code of llama.cpp? #24

Comments

peytoncai commented Aug 20, 2024 • edited by QingtaoLi1 Loading

Tasks

kaleid-liner commented Aug 20, 2024

nctu6 commented Sep 13, 2024 • edited Loading

knyipab commented Sep 15, 2024

kaleid-liner commented Sep 16, 2024 • edited Loading

kaleid-liner commented Sep 16, 2024

BodhiHu commented Sep 24, 2024

nctu6 commented Sep 24, 2024

qw1319 commented Sep 24, 2024

qw1319 commented Sep 24, 2024 • edited Loading

nctu6 commented Sep 24, 2024

qw1319 commented Sep 25, 2024

nctu6 commented Sep 25, 2024

qw1319 commented Sep 25, 2024 • edited Loading

nctu6 commented Sep 26, 2024

kaleid-liner commented Oct 10, 2024

qw1319 commented Oct 11, 2024 • edited Loading

qw1319 commented Oct 11, 2024 • edited Loading

kaleid-liner commented Oct 16, 2024

kaleid-liner commented Oct 16, 2024

qw1319 commented Oct 18, 2024

qw1319 commented Oct 31, 2024

peytoncai commented Aug 20, 2024 •

edited by QingtaoLi1

Loading

nctu6 commented Sep 13, 2024 •

edited

Loading

kaleid-liner commented Sep 16, 2024 •

edited

Loading

qw1319 commented Sep 24, 2024 •

edited

Loading

qw1319 commented Sep 25, 2024 •

edited

Loading

qw1319 commented Oct 11, 2024 •

edited

Loading

qw1319 commented Oct 11, 2024 •

edited

Loading