Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plans to merge the latest code of llama.cpp? #24

Open
1 task
peytoncai opened this issue Aug 20, 2024 · 21 comments
Open
1 task

Any plans to merge the latest code of llama.cpp? #24

peytoncai opened this issue Aug 20, 2024 · 21 comments

Comments

@peytoncai
Copy link

peytoncai commented Aug 20, 2024

Qwen2

warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 2854 (70c312d)
main: built with clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) for x86_64-unknown-linux-gnu
main: seed = 1724130565
[13:09:25] /aaaa/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init
llama_model_loader: loaded meta data with 20 key-value pairs and 386 tensors from /aaaa/Qwen1.5-0.5B-Chat-GPTQ-Int4/ggml-model.in.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = Qwen1.5-0.5B-Chat-GPTQ-Int4
llama_model_loader: - kv 2: qwen2.block_count u32 = 24
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1024
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 2816
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 32
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - type f32: 217 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type i4: 168 tensors
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model ' /aaaa/Qwen1.5-0.5B-Chat-GPTQ-Int4/ggml-model.in.gguf'
main: error: unable to load model

gemma2

Running STEP.0: Compile kernels
Running command in /aaaa/T-MAC/deploy:
python compile.py -o tuned -da -nt 4 -tb -gc -gs 128 -ags 64 -t -m gptq-auto -md /aaaa/gemma-2-9b-it-gptq-4bit
Running STEP.1: Build T-MAC C++ CMakeFiles
Running command in /aaaa/T-MAC/build:
cmake -DCMAKE_INSTALL_PREFIX=/aaaa/T-MAC/install ..
Running STEP.2: Install T-MAC C++
Running command in /aaaa/T-MAC/build:
cmake --build . --target install --config Release
Running STEP.3: Convert HF to GGUF
Running command in /aaaa/T-MAC/3rdparty/llama.cpp:
python convert-hf-to-gguf-t-mac.py /aaaa/gemma-2-9b-it-gptq-4bit --outtype in --outfile /aaaa/gemma-2-9b-it-gptq-4bit/ggml-model.in.gguf --kcfg /aaaa/T-MAC/install/lib/kcfg.ini
Please check logs/2024-08-20-15-29-20.log for what's wrong
(tmac) root@4c5e2a287200:/aaaa/T-MAC# cat logs/2024-08-20-15-29-20.log
INFO:hf-to-gguf:Loading model: gemma-2-9b-it-gptq-4bit
Traceback (most recent call last):
File "convert-hf-to-gguf-t-mac.py", line 3421, in
main()
File "convert-hf-to-gguf-t-mac.py", line 3399, in main
model_class = Model.from_model_architecture(hparams["architectures"][0])
File "convert-hf-to-gguf-t-mac.py", line 318, in from_model_architecture
raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'Gemma2ForCausalLM' not supported!

Tasks

Preview Give feedback
@kaleid-liner
Copy link
Collaborator

We are working on it. llama.cpp is evolving very fast with a lot of refactoring here and there, so it won't be very quick.

@nctu6
Copy link

nctu6 commented Sep 13, 2024

https://github.com/nctu6/llama.cpp/commits/t-mac/

I have merged a version that includes all changes from your llama.cpp repository into the latest llama.cpp.
It can be built and run successfully on the Ubuntu platform.

I hope this helps.
Thank you for your work.

Regards.

@knyipab
Copy link

knyipab commented Sep 15, 2024

Exciting work and thread!

T-MAC focus on edge device and it would be very meaningful to be merged to llama.cpp provided its significance in open LLM software. I packaged ollama to termux's TUR (pkg install tur-repo && pkg update && pkg install -y ollama will do). Once T-MAC is merged to llama.cpp or become part of llama.cpp, it can be tested and accessible to Android device and Termux users widely.

@kaleid-liner
Copy link
Collaborator

kaleid-liner commented Sep 16, 2024

https://github.com/nctu6/llama.cpp/commits/t-mac/

I have merged a version that includes all changes from your llama.cpp repository into the latest llama.cpp. It can be built and run successfully on the Ubuntu platform.

I hope this helps. Thank you for your work.

Regards.

@nctu6 Fantastic work! Let me share some updates on our progress.

After wrapping up some paper-related tasks, I've managed to spare some time to work on the merge. I'm building on the merge codebase by @QingtaoLi1 (https://github.com/QingtaoLi1/llama.cpp/pull/4/files). Your version has been incredibly helpful, and I will dig into the code to see what insights I can glean.

Beyond the merge, I'm also working on some refactoring to prepare a clean pull request to the main llama.cpp repo. It will focus on:

  1. Remove unused legacy code
  2. Refactor the multi-threading scheduling in ggml.c to be consistent with the current dynamic OpenMP scheduling
  3. Refactor the convert script (convert_hf_to_gguf.py), to trim redundant code and ensure it applies to all models based on the Model base class.

Your code will be invaluable for our functionality testing. I'll keep you informed once the merge is complete, which might take a few days.

Thanks again for your fantastic contribution!

@kaleid-liner
Copy link
Collaborator

Exciting work and thread!

T-MAC focus on edge device and it would be very meaningful to be merged to llama.cpp provided its significance in open LLM software. I packaged ollama to termux's TUR (pkg install tur-repo && pkg update && pkg install -y ollama will do). Once T-MAC is merged to llama.cpp or become part of llama.cpp, it can be tested and accessible to Android device and Termux users widely.

@knyipab Thanks! After the merge and some necessary refactoring, I will open a pull request to llama.cpp and hopefully t-mac can be merged as soon as possible. Excellent work to simplify the deployment on Android! It will also be a good demo of t-mac and I will test it. I will also publish the optimized performance data on Android after merging the openmp optimization.

@BodhiHu
Copy link

BodhiHu commented Sep 24, 2024

https://github.com/nctu6/llama.cpp/commits/t-mac/

I have merged a version that includes all changes from your llama.cpp repository into the latest llama.cpp. It can be built and run successfully on the Ubuntu platform.

I hope this helps. Thank you for your work.

Regards.

Hello,
Have you tested your code ? The task type had been removed from the upstream llama.cpp:

nctu6/llama.cpp@c03d69c#diff-f028a352a33ee20b42faca7dcc389e8f0f9c9a55e016cccffed45fe90bcc13f8R12967

I pulled your code and it failed to compile with errors:

ggml.c:12987:16: error: no member named 'type' in 'struct ggml_compute_params'
 12987 |                         if (params->type == GGML_TASK_TYPE_INIT) {

@nctu6
Copy link

nctu6 commented Sep 24, 2024

Hi,

This merge is based on the master branch, and the type was removed from the official master branch three months ago, as indicated in the commit below.

ggerganov/llama.cpp@95f57bb#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5d

Regards.

@qw1319
Copy link

qw1319 commented Sep 24, 2024

https://github.com/nctu6/llama.cpp/commits/t-mac/

I have merged a version that includes all changes from your llama.cpp repository into the latest llama.cpp. It can be built and run successfully on the Ubuntu platform.

I hope this helps. Thank you for your work.

Regards.

i do not have see t-mac/tmac_gemm_wrapper.h and kernel.cc kernel.h, where are they? and i see in your cmake, option(LLAMA_TMAC "llama: use TMAC" OFF), do you not compile tmac?

@qw1319
Copy link

qw1319 commented Sep 24, 2024

Hi,

This merge is based on the master branch, and the type was removed from the official master branch three months ago, as indicated in the commit below.

ggerganov/llama.cpp@95f57bb#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5d

Regards.

I meet the some question with u , do u have resolve it?
when i reset to commit id 95f57bb and cherry-pick your merge commit c03d69c6818b0bbac06efcb025ea892d9d9ef90a , i will meet many merge conflict。。。

@nctu6
Copy link

nctu6 commented Sep 24, 2024

Hello,

  1. Please refer to the following link for the T-MAC build instructions:
    T-MAC Build Guide

    Example from the official README:

    Running STEP.4: Build llama.cpp CMakeFiles
      Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
        cmake .. -DLLAMA_TMAC=ON -DCMAKE_PREFIX_PATH=/Users/user/jianyu/T-MAC/install/lib/cmake/t-mac -DCMAKE_BUILD_TYPE=Release -DLLAMA_LLAMAFILE_DEFAULT=OFF -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
  2. The purpose is to integrate T-MAC into the latest version of llama.cpp (master branch).
    Therefore, the commit I used was the most recent at that time.
    (Commit: 449ccfb6f5f1bbd70e04f75a330d9d7c1af82187)

Best regards

@qw1319
Copy link

qw1319 commented Sep 25, 2024

Hello,

  1. Please refer to the following link for the T-MAC build instructions:
    T-MAC Build Guide
    Example from the official README:
    Running STEP.4: Build llama.cpp CMakeFiles
      Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
        cmake .. -DLLAMA_TMAC=ON -DCMAKE_PREFIX_PATH=/Users/user/jianyu/T-MAC/install/lib/cmake/t-mac -DCMAKE_BUILD_TYPE=Release -DLLAMA_LLAMAFILE_DEFAULT=OFF -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
  2. The purpose is to integrate T-MAC into the latest version of llama.cpp (master branch).
    Therefore, the commit I used was the most recent at that time.
    (Commit: 449ccfb6f5f1bbd70e04f75a330d9d7c1af82187)

Best regards

First: i use this commit 449ccfb6f5f1bbd70e04f75a330d9d7c1af82187, and cherry-pick your merge commit c03d69c6818b0bbac06efcb025ea892d9d9ef90a, and compile like u
but i meet this error `CMake Error at ggml/src/CMakeLists.txt:1344 (add_library):
Cannot find source file:

ggml-tmac.h`

then, i fix cmake error by replace set(GGML_HEADERS_TMAC ../include/ggml-tmac.h)
But i meet another error /llama.cpp/ggml/src/ggml.c:12967:16: error: no member named 'type' in 'struct ggml_compute_params' 12967 | if (params->type == GGML_TASK_TYPE_FINALIZE) { | ~~~~~~ ^ /llama.cpp/ggml/src/ggml.c:12967:24: error: use of undeclared identifier 'GGML_TASK_TYPE_FINALIZE' 12967 | if (params->type == GGML_TASK_TYPE_FINALIZE) { /llama.cpp/ggml/src/ggml.c:12987:16: error: no member named 'type' in 'struct ggml_compute_params' 12987 | if (params->type == GGML_TASK_TYPE_INIT) { /llama.cpp/ggml/src/ggml.c:12987:24: error: use of undeclared identifier 'GGML_TASK_TYPE_INIT' 12987 | if (params->type == GGML_TASK_TYPE_INIT) {

@nctu6
Copy link

nctu6 commented Sep 25, 2024

Hi,
You don't need to change anything.
What I do is simply use the build script run_pipeline.py provided by T-MAC.
Everything works fine when calling run_pipeline.py.
Regards.

@qw1319
Copy link

qw1319 commented Sep 25, 2024

Hi, You don't need to change anything. What I do is simply use the build script run_pipeline.py provided by T-MAC. Everything works fine when calling run_pipeline.py. Regards.

i use run_pipeline.py too. but i meet question
`-- The C compiler identification is Clang 17.0.6
-- The CXX compiler identification is Clang 17.0.6
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /002data/alexjiao/T-MAC/build/clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /002data/alexjiao/T-MAC/build/clang+llvm-17.0.6-x86_64-linux-gnu-ubuntu-22.04/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.25.1")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- OpenMP found
-- Using llamafile
-- TMAC found
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done
CMake Error at ggml/src/CMakeLists.txt:1344 (add_library):
Cannot find source file:

ggml-tmac.h

Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm .h
.hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .hip .ispc

CMake Error at ggml/src/CMakeLists.txt:1344 (add_library):
No SOURCES given to target: ggml

CMake Generate step failed. Build files cannot be regenerated correctly.`

@nctu6
Copy link

nctu6 commented Sep 26, 2024

Thank you for the verification.
I will set up a new environment and test it again to identify the issue.
Regards.

kaleid-liner added a commit that referenced this issue Oct 10, 2024
* [WIP] Merge latest llama.cpp

* Adapt scripts to latest llama.cpp

* Fix run_pipe.py cmake error

* Attempt to optimize performance on arm cpus

* Support armv8.7a+ cpus

* Finish merging and rebasing llama.cpp
@kaleid-liner
Copy link
Collaborator

@nctu6 @peytoncai @qw1319 @BodhiHu Sorry for the delayed update. After fixing several performance related issues, we have finally updated the llama.cpp version (#46). Now you can test qwen2 using https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4. If you encounter any error, feel free to open new issues.

@qw1319
Copy link

qw1319 commented Oct 11, 2024

Have anyone meet this question? when i ues this(new llama.cpp version #46)
I meet this question on android
CANNOT LINK EXECUTABLE "/data/local/tmp/bin/llama-cli": library "libllama.so" not found: needed by main executable

This question is llama.cpp change andorid use method;
you may need to update your readme & run-pipeline.py

@qw1319
Copy link

qw1319 commented Oct 11, 2024

But I compile android kernel like llama.cpp(https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md) with -DGGML_TMAC=ON -DCMAKE_PREFIX_PATH=/our_path, i get worse perormance compare to previous version

@kaleid-liner
Copy link
Collaborator

But I compile android kernel like llama.cpp(https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md) with -DGGML_TMAC=ON -DCMAKE_PREFIX_PATH=/our_path, i get worse perormance compare to previous version

I achieved the same performance as the previous version, and the performance is now much more stable. However, I'm curious why the performance on 8GEN3 hasn't improved like it has on other devices (such as M2-Ultra and Surface Laptop 7). It should have benefited from dynamic dispatch. I'm sparing time to investigate the cause.

@kaleid-liner
Copy link
Collaborator

Have anyone meet this question? when i ues this(new llama.cpp version #46) I meet this question on android CANNOT LINK EXECUTABLE "/data/local/tmp/bin/llama-cli": library "libllama.so" not found: needed by main executable

This question is llama.cpp change andorid use method; you may need to update your readme & run-pipeline.py

Sorry, but I didn't encounter this issue.

@qw1319
Copy link

qw1319 commented Oct 18, 2024

Have anyone meet this question? when i ues this(new llama.cpp version #46) I meet this question on android CANNOT LINK EXECUTABLE "/data/local/tmp/bin/llama-cli": library "libllama.so" not found: needed by main executable
This question is llama.cpp change andorid use method; you may need to update your readme & run-pipeline.py

Sorry, but I didn't encounter this issue.

Image
like this(https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md)

@qw1319
Copy link

qw1319 commented Oct 31, 2024

But I compile android kernel like llama.cpp(https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md) with -DGGML_TMAC=ON -DCMAKE_PREFIX_PATH=/our_path, i get worse perormance compare to previous version

I achieved the same performance as the previous version, and the performance is now much more stable. However, I'm curious why the performance on 8GEN3 hasn't improved like it has on other devices (such as M2-Ultra and Surface Laptop 7). It should have benefited from dynamic dispatch. I'm sparing time to investigate the cause.

This may be tuned env is not compatibled with runtime env(1.8Gen3:have Big core, middle core and efficient core; 2.tvm tune may not mult thread pool balanced scheduling,runtime env(llama.cpp) use threadpool to balance mult thread schedule)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants