-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any plans to merge the latest code of llama.cpp? #24
Comments
We are working on it. llama.cpp is evolving very fast with a lot of refactoring here and there, so it won't be very quick. |
https://github.com/nctu6/llama.cpp/commits/t-mac/ I have merged a version that includes all changes from your llama.cpp repository into the latest llama.cpp. I hope this helps. Regards. |
Exciting work and thread! T-MAC focus on edge device and it would be very meaningful to be merged to llama.cpp provided its significance in open LLM software. I packaged |
@nctu6 Fantastic work! Let me share some updates on our progress. After wrapping up some paper-related tasks, I've managed to spare some time to work on the merge. I'm building on the merge codebase by @QingtaoLi1 (https://github.com/QingtaoLi1/llama.cpp/pull/4/files). Your version has been incredibly helpful, and I will dig into the code to see what insights I can glean. Beyond the merge, I'm also working on some refactoring to prepare a clean pull request to the main llama.cpp repo. It will focus on:
Your code will be invaluable for our functionality testing. I'll keep you informed once the merge is complete, which might take a few days. Thanks again for your fantastic contribution! |
@knyipab Thanks! After the merge and some necessary refactoring, I will open a pull request to llama.cpp and hopefully t-mac can be merged as soon as possible. Excellent work to simplify the deployment on Android! It will also be a good demo of t-mac and I will test it. I will also publish the optimized performance data on Android after merging the openmp optimization. |
Hello, nctu6/llama.cpp@c03d69c#diff-f028a352a33ee20b42faca7dcc389e8f0f9c9a55e016cccffed45fe90bcc13f8R12967 I pulled your code and it failed to compile with errors:
|
Hi, This merge is based on the master branch, and the type was removed from the official master branch three months ago, as indicated in the commit below. ggerganov/llama.cpp@95f57bb#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5d Regards. |
i do not have see |
I meet the some question with u , do u have resolve it? |
Hello,
Best regards |
First: i use this commit
then, i fix cmake error by replace |
Hi, |
i use run_pipeline.py too. but i meet question
Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm .h CMake Error at ggml/src/CMakeLists.txt:1344 (add_library): CMake Generate step failed. Build files cannot be regenerated correctly.` |
Thank you for the verification. |
@nctu6 @peytoncai @qw1319 @BodhiHu Sorry for the delayed update. After fixing several performance related issues, we have finally updated the llama.cpp version (#46). Now you can test qwen2 using https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4. If you encounter any error, feel free to open new issues. |
Have anyone meet this question? when i ues this(new llama.cpp version #46) This question is llama.cpp change andorid use method; |
But I compile android kernel like llama.cpp(https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md) with |
I achieved the same performance as the previous version, and the performance is now much more stable. However, I'm curious why the performance on 8GEN3 hasn't improved like it has on other devices (such as M2-Ultra and Surface Laptop 7). It should have benefited from dynamic dispatch. I'm sparing time to investigate the cause. |
Sorry, but I didn't encounter this issue. |
|
This may be tuned env is not compatibled with runtime env(1.8Gen3:have Big core, middle core and efficient core; 2.tvm tune may not mult thread pool balanced scheduling,runtime env(llama.cpp) use threadpool to balance mult thread schedule) |
Qwen2
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 2854 (70c312d)
main: built with clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) for x86_64-unknown-linux-gnu
main: seed = 1724130565
[13:09:25] /aaaa/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init
llama_model_loader: loaded meta data with 20 key-value pairs and 386 tensors from /aaaa/Qwen1.5-0.5B-Chat-GPTQ-Int4/ggml-model.in.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = Qwen1.5-0.5B-Chat-GPTQ-Int4
llama_model_loader: - kv 2: qwen2.block_count u32 = 24
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1024
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 2816
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 32
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - type f32: 217 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type i4: 168 tensors
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model ' /aaaa/Qwen1.5-0.5B-Chat-GPTQ-Int4/ggml-model.in.gguf'
main: error: unable to load model
gemma2
Running STEP.0: Compile kernels
Running command in /aaaa/T-MAC/deploy:
python compile.py -o tuned -da -nt 4 -tb -gc -gs 128 -ags 64 -t -m gptq-auto -md /aaaa/gemma-2-9b-it-gptq-4bit
Running STEP.1: Build T-MAC C++ CMakeFiles
Running command in /aaaa/T-MAC/build:
cmake -DCMAKE_INSTALL_PREFIX=/aaaa/T-MAC/install ..
Running STEP.2: Install T-MAC C++
Running command in /aaaa/T-MAC/build:
cmake --build . --target install --config Release
Running STEP.3: Convert HF to GGUF
Running command in /aaaa/T-MAC/3rdparty/llama.cpp:
python convert-hf-to-gguf-t-mac.py /aaaa/gemma-2-9b-it-gptq-4bit --outtype in --outfile /aaaa/gemma-2-9b-it-gptq-4bit/ggml-model.in.gguf --kcfg /aaaa/T-MAC/install/lib/kcfg.ini
Please check logs/2024-08-20-15-29-20.log for what's wrong
(tmac) root@4c5e2a287200:/aaaa/T-MAC# cat logs/2024-08-20-15-29-20.log
INFO:hf-to-gguf:Loading model: gemma-2-9b-it-gptq-4bit
Traceback (most recent call last):
File "convert-hf-to-gguf-t-mac.py", line 3421, in
main()
File "convert-hf-to-gguf-t-mac.py", line 3399, in main
model_class = Model.from_model_architecture(hparams["architectures"][0])
File "convert-hf-to-gguf-t-mac.py", line 318, in from_model_architecture
raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'Gemma2ForCausalLM' not supported!
Tasks
The text was updated successfully, but these errors were encountered: