Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPS Support - CMAKE_ARGS="LLAMA_METAL=1" #317

Closed
leedrake5 opened this issue Jun 5, 2023 · 38 comments
Closed

MPS Support - CMAKE_ARGS="LLAMA_METAL=1" #317

leedrake5 opened this issue Jun 5, 2023 · 38 comments
Labels
build hardware Hardware specific issue

Comments

@leedrake5
Copy link

leedrake5 commented Jun 5, 2023

The main llama.cpp has been updated to support GPUs on Mac's with the following flag (tested on my system):

LLAMA_METAL=1 make -j && ./main -m /Downloads/guanaco-65B.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1

It look like the following flag needs to be added to CMake options:

CMAKE_ARGS="LLAMA_METAL=1" FORCE_CMAKE=1 pip install -e .

While it appears that it installs successfully, the library cannot be loaded.

>>> from llama_cpp import Llama, LlamaCache
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/llama_cpp/__init__.py", line 1, in <module>
    from .llama_cpp import *
  File "/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/llama_cpp/llama_cpp.py", line 73, in <module>
    _lib = _load_shared_library(_lib_base_name)
  File "/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/llama_cpp/llama_cpp.py", line 64, in _load_shared_library
    raise FileNotFoundError(
FileNotFoundError: Shared library with base name 'llama' not found

This happens regardless of whether the GitHub repo or PyPy are used.

@zach-brockway
Copy link

zach-brockway commented Jun 5, 2023

I was able to patch this locally:

# Load the library
def _load_shared_library(lib_base_name: str):
    # Determine the file extension based on the platform
    if sys.platform.startswith("linux"):
        lib_ext = ".so"
    elif sys.platform == "darwin":
        lib_ext = ".dylib" # <<< Was also ".so"

However, I still seem to get crashes trying to load models subsequently:

llama_model_load_internal: mem required  = 2532.67 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  = 3120.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '(null)'
ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."
/Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit

Edit: Best I can tell, it's failing to load the Metal shader for some reason, and it seems like that's supposed to be embedded into the dylib somehow?

    // read the source from "ggml-metal.metal" into a string and use newLibraryWithSource
    {
        NSError * error = nil;

        //NSString * path = [[NSBundle mainBundle] pathForResource:@"../../examples/metal/metal" ofType:@"metal"];
        NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
        fprintf(stderr, "%s: loading '%s'\n", __func__, [path UTF8String]);

        NSString * src  = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&error];
        if (error) {
            fprintf(stderr, "%s: error: %s\n", __func__, [[error description] UTF8String]);
            exit(1);
        }

@gjmulder gjmulder added build hardware Hardware specific issue labels Jun 5, 2023
@gjmulder gjmulder changed the title MPS Support MPS Support - CMAKE_ARGS="LLAMA_METAL=1" Jun 5, 2023
@zach-brockway
Copy link

"Fixed" the second issue by copying llama.cpp/ggml-metal.metal to the same directory as my python binary!

@zach-brockway
Copy link

So it seems like the upstream llama.cpp Makefile and CMakeLists disagree about what the extension of the shared library should be. Per this discussion, you can force libllama to be generated with the .so extension instead of .dylib by adding the MODULE keyword here:

add_library(llama MODULE
            llama.cpp
            llama.h
            llama-util.h
            )

Not clear to me if this might negatively impact other platforms, but it's enough to make FORCE_CMAKE builds generate the expected libllama.so, rather than a libllama.dylib that the Python has trouble finding.

@leedrake5
Copy link
Author

leedrake5 commented Jun 5, 2023

Thanks @zach-brockway, I can successfully get it to load with this bit:

# Load the library
def _load_shared_library(lib_base_name: str):
    # Determine the file extension based on the platform
    if sys.platform.startswith("linux"):
        lib_ext = ".so"
    elif sys.platform == "darwin":
        lib_ext = ".dylib" # <<< Was also ".so"

This goes into llama_cpp.py in the site-packages folder. However, it still only uses CPU, not GPU, even when I copy llama.cpp/ggml-metal.metal to site-packages/llama_cpp. I suspect it's because I don't know where this is suppose to go:

    // read the source from "ggml-metal.metal" into a string and use newLibraryWithSource
    {
        NSError * error = nil;

        //NSString * path = [[NSBundle mainBundle] pathForResource:@"../../examples/metal/metal" ofType:@"metal"];
        NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
        fprintf(stderr, "%s: loading '%s'\n", __func__, [path UTF8String]);

        NSString * src  = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&error];
        if (error) {
            fprintf(stderr, "%s: error: %s\n", __func__, [[error description] UTF8String]);
            exit(1);
        }

Also where does this function go?

add_library(llama MODULE
            llama.cpp
            llama.h
            llama-util.h
            )

@zach-brockway
Copy link

Also where does this function go?

add_library(llama MODULE
            llama.cpp
            llama.h
            llama-util.h
            )

That was a modification I had to make to vendor/llama.cpp/CMakeLists.txt (line 412). The workarounds are starting to pile up at every level! But I'm pretty sure llama.cpp will in fact want to take a fix for this upstream. Either their static Makefile should output .dylib, or their CMakeLists.txt should output .so; no good reason for the current discrepancy.

@leedrake5
Copy link
Author

Many thanks, but worried this may be a dead end. If I use this version in CMakeLists.txt:

add_library(llama MODULE
            llama.cpp
            llama.h
            llama-util.h
            )

I get this error:

CMake Error at CMakeLists.txt:418 (target_include_directories):
  Cannot specify include directories for target "llama" which is not built by
  this project.
CMake Error at tests/CMakeLists.txt:4 (target_link_libraries):
  Target "llama" of type MODULE_LIBRARY may not be linked into another
  target.  One may link only to INTERFACE, OBJECT, STATIC or SHARED
  libraries, or to executables with the ENABLE_EXPORTS property set.
Call Stack (most recent call first):
  tests/CMakeLists.txt:9 (llama_add_test)

If I try to make it more explicit:

add_library(llama.so
            llama.cpp
            llama.h
            llama-util.h
            )

I get the same error. I really appreciate your help trying to workaround this, but I think you are right, this needs to happen upstream. It works fine by command line, but interfacing with the python package makes it very difficult.

@abetlen
Copy link
Owner

abetlen commented Jun 5, 2023

Sorry for the slow reply, I should be able to get access to an M1 tonight and get this sorted, cheers.

@mhenrichsen
Copy link

@abetlen sounds awesome. Please let me know if you're having issues and I'll let you ssh into one of mine :)

@Jchang4
Copy link

Jchang4 commented Jun 6, 2023

@abetlen hey any updates? This would be an amazing update!

@fungyeung
Copy link

"Fixed" the second issue by copying llama.cpp/ggml-metal.metal to the same directory as my python binary!

I tried copying ggml-metal.metal to multiple locations but still got the "file name is invalid" error.
Eventually, I "fixed" it by hardcoding the absolute path in vendor/llama.cpp/ggml-metal.m around line 101:

//NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
NSString * path = @"/path/to/vendor/llama.cpp/ggml-metal.metal";

Then recompile it.

@abetlen
Copy link
Owner

abetlen commented Jun 8, 2023

I added an option to llama_cpp.py to accept both .so and .dylib extensions on macos.

"Fixed" the second issue by copying llama.cpp/ggml-metal.metal to the same directory as my python binary!

@zach-brockway can you expand on this?

@lucasquinteiro
Copy link

@abetlen how should we install llama-cpp-python to make it work with LLAMA_METAL?

I added an option to llama_cpp.py to accept both .so and .dylib extensions on macos.

@zach-brockway
Copy link

"Fixed" the second issue by copying llama.cpp/ggml-metal.metal to the same directory as my python binary!

@zach-brockway can you expand on this?

Sure! So the ggml_metal_init errors I was receiving when attempting to load a model (loading '(null)' / error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid.") turned out to be attributable to the llama.cpp code I quoted in the edit to my first comment, where it tries to locate the ggml-metal.metal shader file using NSBundle pathForResource:ofType:.

To work around this, I ended up running the equivalent of the following command: cp vendor/llama.cpp/ggml-metal.metal $(dirname $(which python)) (the destination, in my case, was something like /opt/homebrew/Caskroom/miniconda/base/envs/mycondaenv/bin).

It seems like better upstream fixes might be something like having the shared library look alongside where it's located on disk, or ideally even embedding the shader into the dylib at build time somehow (since if the compiled code and shader get out of sync, that can also cause crashes).

@fungyeung
Copy link

"Fixed" the second issue by copying llama.cpp/ggml-metal.metal to the same directory as my python binary!

@zach-brockway can you expand on this?

Sure! So the ggml_metal_init errors I was receiving when attempting to load a model (loading '(null)' / error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid.") turned out to be attributable to the llama.cpp code I quoted in the edit to my first comment, where it tries to locate the ggml-metal.metal shader file using NSBundle pathForResource:ofType:.

To work around this, I ended up running the equivalent of the following command: cp vendor/llama.cpp/ggml-metal.metal $(dirname $(which python)) (the destination, in my case, was something like /opt/homebrew/Caskroom/miniconda/base/envs/mycondaenv/bin).

It seems like better upstream fixes might be something like having the shared library look alongside where it's located on disk, or ideally even embedding the shader into the dylib at build time somehow (since if the compiled code and shader get out of sync, that can also cause crashes).

Thanks, I tried to do the same. In my case my python is located at venv/bin/python so I copied ggml-metal.metal to venv/bin. It didn't work though. The only way I could make it work is to hardcode the NSString path in ggml-metal.metal.

@zach-brockway
Copy link

Thanks, I tried to do the same. In my case my python is located at venv/bin/python so I copied ggml-metal.metal to venv/bin. It didn't work though. The only way I could make it work is to hardcode the NSString path in ggml-metal.metal.

venv is a special case I think, the bin directory just contains symlinks to the underlying Python distribution that was active at the time you created the environment:

$ ls -lha python*
lrwxr-xr-x  1 zach  staff     7B May 21 22:33 python -> python3
lrwxr-xr-x  1 zach  staff    49B May 21 22:33 python3 -> /opt/homebrew/Caskroom/miniconda/base/bin/python3
lrwxr-xr-x  1 zach  staff     7B May 21 22:33 python3.9 -> python3

A workaround might be to use cp vendor/llama.cpp/ggml-metal.metal $(dirname $(realpath $(which python))), where realpath resolves the symlink first.

@abetlen
Copy link
Owner

abetlen commented Jun 8, 2023

@zach-brockway I think you're right that this requires a change to how llama.cpp is built as a shared library. I'll try to work on a PR for that but I only have remote access to a Mac so if anyone else is a better cmake ninja and has a mac in front of them I would really appreciate the help.

@jacobfriedman
Copy link

Trying to build as a shared object as part of another project yields this result. Best to ignore the problem with python and focus on the core issue. I'll see if I can wrangle a fix

@ianscrivener
Copy link
Contributor

@zach-brockway I think you're right that this requires a change to how llama.cpp is built as a shared library. I'll try to work on a PR for that but I only have remote access to a Mac so if anyone else is a better cmake ninja and has a mac in front of them I would really appreciate the help.

@abetlen I've spent an hour or so doing different built variants trying to isolate (and fix?) this issue.. so far without success. I have had llama-cpp-python working a couple of times.. but haven't yet isolated the a reproducable/working install process for MacOS Metal.

@abetlen
Copy link
Owner

abetlen commented Jun 10, 2023

Just pushed v0.1.62 that includes Metal support, let me know if that works!

@WojtekKowaluk
Copy link

@abetlen: installed with CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python and it works:

INFO:Loading 7B...
INFO:llama.cpp weights detected: models/7B/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin

INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models/7B/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  = 1024.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/wojtek/Documents/text-generation-webui/venv/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x16a366ec0
ggml_metal_init: loaded kernel_mul                            0x16a3674f0
ggml_metal_init: loaded kernel_mul_row                        0x16a3678f0
ggml_metal_init: loaded kernel_scale                          0x16a367cf0
ggml_metal_init: loaded kernel_silu                           0x16a3680f0
ggml_metal_init: loaded kernel_relu                           0x16a3684f0
ggml_metal_init: loaded kernel_gelu                           0x16a3688f0
ggml_metal_init: loaded kernel_soft_max                       0x16a368e80
ggml_metal_init: loaded kernel_diag_mask_inf                  0x16a369280
ggml_metal_init: loaded kernel_get_rows_f16                   0x1209d5250
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x1209ecd30
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x1209edb70
ggml_metal_init: loaded kernel_get_rows_q2_k                  0x1209ee300
ggml_metal_init: loaded kernel_get_rows_q4_k                  0x16a369680
ggml_metal_init: loaded kernel_get_rows_q6_k                  0x16a369be0
ggml_metal_init: loaded kernel_rms_norm                       0x16a36a170
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x16a36a8b0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x16a36aff0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x1209ee8a0
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32               0x1209ef0d0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32               0x1209ef670
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32               0x16a36b730
ggml_metal_init: loaded kernel_rope                           0x16a36bf00
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x16a36c7f0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x1209efc40
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3616.08 MB
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =   768.00 MB
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1026.00 MB
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
INFO:Loaded the model in 1.04 seconds.


llama_print_timings:        load time =  6887.87 ms
llama_print_timings:      sample time =   133.05 ms /    77 runs   (    1.73 ms per token)
llama_print_timings: prompt eval time =  6887.83 ms /    16 tokens (  430.49 ms per token)
llama_print_timings:        eval time =  8762.61 ms /    76 runs   (  115.30 ms per token)
llama_print_timings:       total time = 16282.09 ms
Output generated in 16.53 seconds (4.60 tokens/s, 76 tokens, context 16, seed 1703054888)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6887.87 ms
llama_print_timings:      sample time =   229.93 ms /    77 runs   (    2.99 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  7226.16 ms /    77 runs   (   93.85 ms per token)
llama_print_timings:       total time =  8139.00 ms
Output generated in 8.44 seconds (9.01 tokens/s, 76 tokens, context 16, seed 1286945878)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6887.87 ms
llama_print_timings:      sample time =   133.86 ms /    77 runs   (    1.74 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  7927.91 ms /    77 runs   (  102.96 ms per token)
llama_print_timings:       total time =  8573.75 ms
Output generated in 8.84 seconds (8.60 tokens/s, 76 tokens, context 16, seed 708232749)

@abetlen abetlen closed this as completed Jun 10, 2023
@pgagarinov
Copy link

@abetlen running CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python inside a virtual environment or inside conda environment doesn't solve the problem - the model still only uses CPU:

llama.cpp: loading model from /Users/peter/_Git/_GPT/llama.cpp/models/wizardLM-7B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

llama_print_timings:        load time =   634.15 ms
llama_print_timings:      sample time =   229.50 ms /   333 runs   (    0.69 ms per token)
llama_print_timings: prompt eval time =   634.07 ms /    11 tokens (   57.64 ms per token)
llama_print_timings:        eval time = 13948.15 ms /   332 runs   (   42.01 ms per token)
llama_print_timings:       total time = 16233.21 ms

@lucasquinteiro
Copy link

@abetlen running CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python inside a virtual environment or inside conda environment doesn't solve the problem - the model still only uses CPU:

llama.cpp: loading model from /Users/peter/_Git/_GPT/llama.cpp/models/wizardLM-7B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

llama_print_timings:        load time =   634.15 ms
llama_print_timings:      sample time =   229.50 ms /   333 runs   (    0.69 ms per token)
llama_print_timings: prompt eval time =   634.07 ms /    11 tokens (   57.64 ms per token)
llama_print_timings:        eval time = 13948.15 ms /   332 runs   (   42.01 ms per token)
llama_print_timings:       total time = 16233.21 ms

Have you updated the library with the latest changes?

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

@pgagarinov
Copy link

@ianscrivener Yes, I've updated the library.

The solution was to pass n_gpu_layers=1 into the constructor:

Llama(model_path=llama_path, n_gpu_layers=1)

Without that the model doesn't use GPU. Sorry for the false alarm.

@ianscrivener
Copy link
Contributor

ianscrivener commented Jun 12, 2023 via email

@gjmulder
Copy link
Contributor

Set n_gpu_layers=1000 to move all LLM layers to the GPU. Only reduce this number to less than the number of layers the LLM has if you are running low on GPU memory.

@karrtikiyer
Copy link

I see that MPS being used:

llama_init_from_file: kv self size  = 6093.75 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '~/.pyenv/versions/mambaforge/envs/gptwizards/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x12ee8d010
ggml_metal_init: loaded kernel_mul                            0x12ee8d270
ggml_metal_init: loaded kernel_mul_row                        0x12ee8d4d0
ggml_metal_init: loaded kernel_scale                          0x12ee8d730
ggml_metal_init: loaded kernel_silu                           0x12ee8d990
ggml_metal_init: loaded kernel_relu                           0x12ee8dbf0
ggml_metal_init: loaded kernel_gelu                           0x12ee8de50
ggml_metal_init: loaded kernel_soft_max                       0x12ee8e0b0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x12ee8e310
ggml_metal_init: loaded kernel_get_rows_f16                   0x12ee8e570
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x12ee8e7d0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x12ee8ea30
ggml_metal_init: loaded kernel_get_rows_q2_k                  0x12ee8ec90
ggml_metal_init: loaded kernel_get_rows_q3_k                  0x12ee8eef0
ggml_metal_init: loaded kernel_get_rows_q4_k                  0x12ee8f150
ggml_metal_init: loaded kernel_get_rows_q5_k                  0x12ee8f3b0
ggml_metal_init: loaded kernel_get_rows_q6_k                  0x12ee8f610
ggml_metal_init: loaded kernel_rms_norm                       0x12ee8f870
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x12ee8fe10
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x14b337050
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x14b337470
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32               0x14b337890
ggml_metal_init: loaded kernel_mul_mat_q3_k_f32               0x14b337cd0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32               0x14b338390
ggml_metal_init: loaded kernel_mul_mat_q5_k_f32               0x14b3388d0
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32               0x14b338e10
ggml_metal_init: loaded kernel_rope                           0x14b339560
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x14b339e50
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x14b33a540
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 14912.78 MB
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1280.00 MB
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  6095.75 MB
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

However in the activity monitor the GPU usage is 0%, can someone advise please?
Screenshot 2023-06-16 at 8 26 25 PM

@alexshmmy
Copy link

alexshmmy commented Aug 18, 2023

My tests with llama2 7B, 13B and 70B models in my Mac M1 64GB RAM here:
ggerganov/llama.cpp#2508 (comment)

Summary of results:

  • The models llama-2-7b-chat.ggmlv3.q8_0.bin, llama-2-13b-chat.ggmlv3.q8_0.bin and llama-2-70b-chat.ggmlv3.q4_0.bin work with CPU (do not forget the paramter n_gqa = 8 for the 70B model)
  • The models llama-2-7b-chat.ggmlv3.q4_0.bin, llama-2-13b-chat.ggmlv3.q4_0.bin work with GPU metal.
  • The models llama-2-13b-chat.ggmlv3.q8_0.bin, llama-2-70b-chat.ggmlv3.q4_0.bin does not work with GPU.

@alexshmmy
Copy link

alexshmmy commented Aug 18, 2023

@karrtikiyer
The following code runs in my M1 MPS 64GB GPU metal 32 cores:

Screenshot 2023-08-18 at 16 04 46

Model:

  • llama-2-13b-chat.ggmlv3.q4_0.bin

Installation:

conda create -n llamaM1 python=3.9.16
conda activate llamaM1
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
python testM1llama.py

Working code for M1 metal GPU:

from llama_cpp import Llama

model_path = './llama-2-13b-chat.ggmlv3.q4_0.bin'
lm = Llama(model_path,
             n_ctx = 2048,
             n_gpu_layers = 600)

output = lm("Provide a Python function that gets input a positive integer and output a list of it prime factors.",
              max_tokens = 1000, 
              stream = True)
  
for token in output:
    print(token['choices'][0]['text'], end='', flush=True)

Code output:

llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 7477.72 MB (+ 1600.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/miniforge3/envs/llamaM1/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x15503f680
ggml_metal_init: loaded kernel_add_row                        0x155041820
ggml_metal_init: loaded kernel_mul                            0x123e08390
ggml_metal_init: loaded kernel_mul_row                        0x123e089b0
ggml_metal_init: loaded kernel_scale                          0x123e09d80
ggml_metal_init: loaded kernel_silu                           0x123e0a410
ggml_metal_init: loaded kernel_relu                           0x123e092d0
ggml_metal_init: loaded kernel_gelu                           0x123e09530
ggml_metal_init: loaded kernel_soft_max                       0x123e0b7d0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1551795c0
ggml_metal_init: loaded kernel_get_rows_f16                   0x155179980
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x15517ae20
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x155179be0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x15517be50
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x153fa2b90
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x153fa3770
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x153fa8760
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x153fa8e50
ggml_metal_init: loaded kernel_rms_norm                       0x153fa9540
ggml_metal_init: loaded kernel_norm                           0x153fa9c70
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x153faa400
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x153faac80
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x153fab490
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x153fac590
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x153facd20
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x153fad4e0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x153fadc80
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x153fae9c0
ggml_metal_init: loaded kernel_rope                           0x153faef00
ggml_metal_init: loaded kernel_alibi_f32                      0x155040e90
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x155041d40
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x155042310
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x1550434c0
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =    87.89 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, ( 6984.52 / 49152.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    12.00 MB, ( 6996.52 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1602.00 MB, ( 8598.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   290.00 MB, ( 8888.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   192.00 MB, ( 9080.52 / 49152.00)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

Example:

>>> factor(5)
[1, 5]

>>> factor(25)
[3, 5]

Note: The input integer is always positive, so you can assume that the input is a non-negative integer.

Here are some hints to help you write this function:

* A prime number is a positive integer that is divisible only by itself and 1.
* You can use the built-in `isprime` function from the `math.gcd` module to check if an integer is prime.
* You can use a loop to iterate over the range of possible divisors (2 to n/2, where n is the input integer) and check if each one is a factor.
* If you find a prime factor, you can add it to the list of factors and continue iterating until you have found all the prime factors.

def factor(n):
# base case: if n = 1, return [1]
if n == 1:
return [1]

# recursive case: if n is not 1, find its prime factors and return a list of factors
factors = []
for i in range(2, int(n/2) + 1):
    if n % i == 0:
        factors.append(i)
        n = n // i
        while n % i == 0:
            factors.append(i)
            n = n // i

# check if n is prime, if it is, add it to the list of factors
if not any(x > 1 for x in factors):
    factors.append(n)

return factors

This function uses a loop to iterate over the range of possible divisors (2 to n/2) and checks if each one is a factor. If a prime factor is found, it is added to the list of factors and the iteration continues until all prime factors are found. The function also checks if the input integer is prime, and if so, it adds it to the list of factors.
Here's an example of how the function works:
>>> factor(5)
[1, 5]

The function starts by checking if 5 is prime. Since it is not prime (5 % 2 == 0), it iterates over the range of possible divisors (2 to 5/2 + 1 = 3). It finds that 5 is divisible by 3, so it adds 3 to the list of factors and continues iterating until all prime factors are found. The final list of factors is [1, 3, 5].
Note that this function assumes that the input integer is non-negative. If you need to handle negative integers as well, you can modify the function accordingly

llama_print_timings:        load time =  5757.52 ms
llama_print_timings:      sample time =  1421.70 ms /   566 runs   (    2.51 ms per token,   398.12 tokens per second)
llama_print_timings: prompt eval time =  5757.49 ms /    22 tokens (  261.70 ms per token,     3.82 tokens per second)
llama_print_timings:        eval time = 22935.47 ms /   565 runs   (   40.59 ms per token,    24.63 tokens per second)
llama_print_timings:       total time = 32983.57 ms
.

ggml_metal_free: deallocating

@ianscrivener
Copy link
Contributor

here's the documentation for installing for MacOS: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md

seems you missed the pytorch step...

@ahmed-man3
Copy link

I have Macbook M1 Pro 16 GB RAM. I am trying to run the model on GPU using the below lines and it is working fine.
LLAMA_METAL=1 make -j && ./main -m ./models/llama-2-13b-chat.Q4_0.gguf -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1

However, when trying to run it using the below code I received the below error

from llama_cpp import Llama

model_path = '/Users/asq/llama.cpp/models/llama-2-13b-chat.Q4_0.gguf'
lm = Llama(model_path,
             n_ctx = 2048,
             n_gpu_layers = 1)

output = lm("Give me a list of famous mathematicians between born from 1800 to 2000.",
              max_tokens = 1000, 
              stream = True)
  
for token in output:
    print(token['choices'][0]['text'], end='', flush=True)

below is the error. can anyone advise.

                                                                 ^
program_source:2349:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q2_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q2_K, QK_NL, dequantize_q2_K>;
                                                                 ^
program_source:2350:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q3_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q3_K, QK_NL, dequantize_q3_K>;
                                                                 ^
program_source:2351:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q4_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q4_K, QK_NL, dequantize_q4_K>;
                                                                 ^
program_source:2352:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q5_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q5_K, QK_NL, dequantize_q5_K>;
                                                                 ^
program_source:2353:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q6_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q6_K, QK_NL, dequantize_q6_K>;
                                                                 ^
}
llama_new_context_with_model: ggml_metal_init() failed
Traceback (most recent call last):
  File "/Users/asq/Documents/ML/Llama2-Chatbot-main/testMetal.py", line 4, in <module>
    lm = Llama(model_path,
  File "/Users/asq/opt/anaconda3/envs/llama/lib/python3.9/site-packages/llama_cpp/llama.py", line 350, in __init__
    assert self.ctx is not None
AssertionError

@ianscrivener
Copy link
Contributor

ianscrivener commented Sep 19, 2023

@ahmed-man3,
I just tested your python code. Works fine with llama-2-7b-chat.ggmlv3.q6_K.ggu on my Macbook M2 Pro 16 GB RAM.

Thoughts;
(1) Make sure you (force) pull and install the latest llama-cpp-python (and hence llama.cpp), ie

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --no-cache-dir llama-cpp-python
pip install 'llama-cpp-python[server]'

(2) Did you download the .gguf model.. or convert?
To rule a problem with the model I use gguf models from TheBloke - have had issues with models from others

@ahmed-man3
Copy link

@ianscrivener

Thank you for your prompt support. The required installation (1) has been completed as suggested. for the model, yes it has been downloaded as .gguf from theBloke using this link https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/tree/main

@ianscrivener
Copy link
Contributor

Unsure - perhaps try;

  1. move the gguf to the python file (I have had issues with absolute paths)
  2. see if the 13B model works with CPU only in llama-cpp-python
  3. try llama-cpp-python with ctx 1096
  4. try a different model - maybe llama-2-7b-chat.ggmlv3.q6_K.gguf
  5. try a different python version - I'm using 3.10.12

@ahmed-man3
Copy link

Unsure - perhaps try;

  1. move the gguf to the python file (I have had issues with absolute paths)
  2. see if the 13B model works with CPU only in llama-cpp-python
  3. try llama-cpp-python with ctx 1096
  4. try a different model - maybe llama-2-7b-chat.ggmlv3.q6_K.gguf
  5. try a different python version - I'm using 3.10.12

Very much appreciated your support. It worked fine now by applying # 1 & 5. thank you

@ianscrivener
Copy link
Contributor

Good to hear. 🏆

@shrijayan
Copy link

Why we are giving "LLAMA_METAL=1"?

@ianscrivener
Copy link
Contributor

ianscrivener commented Oct 31, 2023

Previously LLAMA_METAL=1 was required for building for MacOS with Metal... but now Metal is enabled by default.

"To disable the Metal build at compile time use the LLAMA_NO_METAL=1 flag or the LLAMA_METAL=OFF cmake option"

@shrijayan
Copy link

shrijayan commented Oct 31, 2023

Previously LLAMA_METAL=1 was required for building for MacOS with Metal... but now Metal is enabled by default.

"To disable the Metal build at compile time use the LLAMA_NO_METAL=1 flag or the LLAMA_METAL=OFF cmake option"

Thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build hardware Hardware specific issue
Projects
None yet
Development

No branches or pull requests