-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPS Support - CMAKE_ARGS="LLAMA_METAL=1" #317
Comments
I was able to patch this locally: # Load the library
def _load_shared_library(lib_base_name: str):
# Determine the file extension based on the platform
if sys.platform.startswith("linux"):
lib_ext = ".so"
elif sys.platform == "darwin":
lib_ext = ".dylib" # <<< Was also ".so" However, I still seem to get crashes trying to load models subsequently:
Edit: Best I can tell, it's failing to load the Metal shader for some reason, and it seems like that's supposed to be embedded into the dylib somehow? // read the source from "ggml-metal.metal" into a string and use newLibraryWithSource
{
NSError * error = nil;
//NSString * path = [[NSBundle mainBundle] pathForResource:@"../../examples/metal/metal" ofType:@"metal"];
NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
fprintf(stderr, "%s: loading '%s'\n", __func__, [path UTF8String]);
NSString * src = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&error];
if (error) {
fprintf(stderr, "%s: error: %s\n", __func__, [[error description] UTF8String]);
exit(1);
} |
"Fixed" the second issue by copying |
So it seems like the upstream llama.cpp Makefile and CMakeLists disagree about what the extension of the shared library should be. Per this discussion, you can force libllama to be generated with the
Not clear to me if this might negatively impact other platforms, but it's enough to make FORCE_CMAKE builds generate the expected |
Thanks @zach-brockway, I can successfully get it to load with this bit:
This goes into llama_cpp.py in the site-packages folder. However, it still only uses CPU, not GPU, even when I copy llama.cpp/ggml-metal.metal to site-packages/llama_cpp. I suspect it's because I don't know where this is suppose to go:
Also where does this function go?
|
That was a modification I had to make to |
Many thanks, but worried this may be a dead end. If I use this version in CMakeLists.txt:
I get this error:
If I try to make it more explicit:
I get the same error. I really appreciate your help trying to workaround this, but I think you are right, this needs to happen upstream. It works fine by command line, but interfacing with the python package makes it very difficult. |
Sorry for the slow reply, I should be able to get access to an M1 tonight and get this sorted, cheers. |
@abetlen sounds awesome. Please let me know if you're having issues and I'll let you ssh into one of mine :) |
@abetlen hey any updates? This would be an amazing update! |
I tried copying
Then recompile it. |
I added an option to
@zach-brockway can you expand on this? |
@abetlen how should we install llama-cpp-python to make it work with LLAMA_METAL?
|
Sure! So the To work around this, I ended up running the equivalent of the following command: It seems like better upstream fixes might be something like having the shared library look alongside where it's located on disk, or ideally even embedding the shader into the dylib at build time somehow (since if the compiled code and shader get out of sync, that can also cause crashes). |
Thanks, I tried to do the same. In my case my python is located at |
A workaround might be to use |
@zach-brockway I think you're right that this requires a change to how llama.cpp is built as a shared library. I'll try to work on a PR for that but I only have remote access to a Mac so if anyone else is a better cmake ninja and has a mac in front of them I would really appreciate the help. |
Trying to build as a shared object as part of another project yields this result. Best to ignore the problem with python and focus on the core issue. I'll see if I can wrangle a fix |
@abetlen I've spent an hour or so doing different built variants trying to isolate (and fix?) this issue.. so far without success. I have had llama-cpp-python working a couple of times.. but haven't yet isolated the a reproducable/working install process for MacOS Metal. |
Just pushed v0.1.62 that includes Metal support, let me know if that works! |
@abetlen: installed with
|
@abetlen running
|
Have you updated the library with the latest changes?
|
@ianscrivener Yes, I've updated the library. The solution was to pass n_gpu_layers=1 into the constructor:
Without that the model doesn't use GPU. Sorry for the false alarm. |
Great... working beautifully now. 🤙
Good work all!! 🏆
Many thanks 🙏
On Tue, 13 Jun 2023, at 06:36, Peter Gagarinov wrote:
@ianscrivener <https://github.com/ianscrivener> Yes, I've updated the library.
The solution was to pass n_gpu_layers=1 into the constructor:
`Llama(model_path=llama_path, n_gpu_layers=1)
`
… Without that the model doesn't use GPU. Sorry for the false alarm.
—
Reply to this email directly, view it on GitHub <#317 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABH3HPWX5VZN2T7RTD5GSDXK54UVANCNFSM6AAAAAAY2N3IJ4>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Set |
My tests with llama2 7B, 13B and 70B models in my Mac M1 64GB RAM here: Summary of results:
|
@karrtikiyer Model:
Installation: conda create -n llamaM1 python=3.9.16
conda activate llamaM1
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
python testM1llama.py Working code for M1 metal GPU: from llama_cpp import Llama
model_path = './llama-2-13b-chat.ggmlv3.q4_0.bin'
lm = Llama(model_path,
n_ctx = 2048,
n_gpu_layers = 600)
output = lm("Provide a Python function that gets input a positive integer and output a list of it prime factors.",
max_tokens = 1000,
stream = True)
for token in output:
print(token['choices'][0]['text'], end='', flush=True) Code output: llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 7477.72 MB (+ 1600.00 MB per state)
llama_new_context_with_model: kv self size = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/miniforge3/envs/llamaM1/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x15503f680
ggml_metal_init: loaded kernel_add_row 0x155041820
ggml_metal_init: loaded kernel_mul 0x123e08390
ggml_metal_init: loaded kernel_mul_row 0x123e089b0
ggml_metal_init: loaded kernel_scale 0x123e09d80
ggml_metal_init: loaded kernel_silu 0x123e0a410
ggml_metal_init: loaded kernel_relu 0x123e092d0
ggml_metal_init: loaded kernel_gelu 0x123e09530
ggml_metal_init: loaded kernel_soft_max 0x123e0b7d0
ggml_metal_init: loaded kernel_diag_mask_inf 0x1551795c0
ggml_metal_init: loaded kernel_get_rows_f16 0x155179980
ggml_metal_init: loaded kernel_get_rows_q4_0 0x15517ae20
ggml_metal_init: loaded kernel_get_rows_q4_1 0x155179be0
ggml_metal_init: loaded kernel_get_rows_q2_K 0x15517be50
ggml_metal_init: loaded kernel_get_rows_q3_K 0x153fa2b90
ggml_metal_init: loaded kernel_get_rows_q4_K 0x153fa3770
ggml_metal_init: loaded kernel_get_rows_q5_K 0x153fa8760
ggml_metal_init: loaded kernel_get_rows_q6_K 0x153fa8e50
ggml_metal_init: loaded kernel_rms_norm 0x153fa9540
ggml_metal_init: loaded kernel_norm 0x153fa9c70
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x153faa400
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x153faac80
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x153fab490
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x153fac590
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x153facd20
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x153fad4e0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x153fadc80
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x153fae9c0
ggml_metal_init: loaded kernel_rope 0x153faef00
ggml_metal_init: loaded kernel_alibi_f32 0x155040e90
ggml_metal_init: loaded kernel_cpy_f32_f16 0x155041d40
ggml_metal_init: loaded kernel_cpy_f32_f32 0x155042310
ggml_metal_init: loaded kernel_cpy_f16_f16 0x1550434c0
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 87.89 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, ( 6984.52 / 49152.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 12.00 MB, ( 6996.52 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1602.00 MB, ( 8598.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 290.00 MB, ( 8888.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 192.00 MB, ( 9080.52 / 49152.00)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
Example:
>>> factor(5)
[1, 5]
>>> factor(25)
[3, 5]
Note: The input integer is always positive, so you can assume that the input is a non-negative integer.
Here are some hints to help you write this function:
* A prime number is a positive integer that is divisible only by itself and 1.
* You can use the built-in `isprime` function from the `math.gcd` module to check if an integer is prime.
* You can use a loop to iterate over the range of possible divisors (2 to n/2, where n is the input integer) and check if each one is a factor.
* If you find a prime factor, you can add it to the list of factors and continue iterating until you have found all the prime factors.
def factor(n):
|
here's the documentation for installing for MacOS: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md seems you missed the pytorch step... |
I have Macbook M1 Pro 16 GB RAM. I am trying to run the model on GPU using the below lines and it is working fine. However, when trying to run it using the below code I received the below error
below is the error. can anyone advise.
|
@ahmed-man3, Thoughts;
(2) Did you download the .gguf model.. or convert? |
Thank you for your prompt support. The required installation (1) has been completed as suggested. for the model, yes it has been downloaded as .gguf from theBloke using this link https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/tree/main |
Unsure - perhaps try;
|
Very much appreciated your support. It worked fine now by applying # 1 & 5. thank you |
Good to hear. 🏆 |
Why we are giving "LLAMA_METAL=1"? |
Previously "To disable the Metal build at compile time use the |
Thank you so much |
The main llama.cpp has been updated to support GPUs on Mac's with the following flag (tested on my system):
LLAMA_METAL=1 make -j && ./main -m /Downloads/guanaco-65B.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1
It look like the following flag needs to be added to CMake options:
CMAKE_ARGS="LLAMA_METAL=1" FORCE_CMAKE=1 pip install -e .
While it appears that it installs successfully, the library cannot be loaded.
This happens regardless of whether the GitHub repo or PyPy are used.
The text was updated successfully, but these errors were encountered: