-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running with Metal for llama-2-13b-chat.ggmlv3.q8_0.bin with -ngl throw unimplemented error #2508
Comments
That's true. Q8_0 is not supported under Metal as of now. Same for Q5_0 and Q5_1. |
Just out of curiosity, is there a technical limitation to why these aren't supported, or have these just not been implemented? |
No limitations - should be easy to support. PRs welcome |
@ggerganov if this isn't too hard to do, I can try take a look if you give me some pointers, but I hadn't worked with c/c++ for many many years and extremely rusty. I would be interested to compare 70B q4 and q8, that's what prompted my post. Just want to check out how much quantization can degrade the biggest models. |
For me it works great the from pyllamacpp.model import Model
input = "I want you to act as a physician. Explain what superconductors are."
model_path='./llama-2-13b-chat.ggmlv3.q8_0.bin'
model = Model(model_path)
for token in model.generate(input):
print(token, end='', flush=True) Output of code: $python testLLM13B.py
llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 15237.95 MB (+ 3216.00 MB per state)
.
llama_init_from_file: kv self size = 800.00 MB
Explain their properties and the potential benefits they offer.
Superconductors are materials that exhibit zero electrical resistance when cooled below a certain temperature, known as the critical temperature (Tc). This means that superconductors can conduct electricity with perfect efficiency and without any loss of energy.
The properties of superconductors include:
1. Zero electrical resistance: Superconductors have zero electrical resistance when cooled below Tc, which makes them ideal for high-power appli as power transmission and storage.
2. Perfect diamagnetism: Superconductors expel magnetic fields when cooled below Tc, which makes them useful in MRI machines and other medical applications.
3. Quantum levitation: Superconductors can levitate above a magnet when cooled below Tc, which has potential applications in transportation and energy storage.
4. High-temperature superconductivity: Some superconductors have critical temperatures above the boiling point of liquid nitrogen (77 K), making them more practical for real-world applications.
The potential benefits of superconductors include:
1. More efficient power transmission and storage: Superconductors can transmit and store electricity with perfect efficiency, which could lead to significant energy savings and reduced carbon emissions.
2. Improved medical imaging: Superconducting magnets are used in MRI machines, which provide higher-resolution images and faster scan times than traditional magnets.
3. High-speed transportation: Superconductors could be used to create magnetic levitation trains that are faster and more efficient than conventional trains.
4. Enhanced security: Superconducting sensors can detect even slight changes in magnetic fields, which could be useful in security applications such as intrusion detection.
5. Energy storage: Superconductors could be used to store energy generated by renewable sources such as wind and solar power, which could help to reduce our reliance on fossil fuels.
Overall, superconductors have the potential to revolutionize a wide range of industries and provide significant benefits to society. However, more research is needed to fully understand their properties and potential applications. |
Is this running in CPU or Metal? 8bit works fine on CPU |
So far, in my Mac M1 MAX 64GB ram, 10 cores cpu, 32 cores gpu:
Installation: conda create -n llamaM1 python=3.9.16
conda activate llamaM1
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
python testM1llama.py Working code for M1 metal GPU: from llama_cpp import Llama
model_path = './llama-2-13b-chat.ggmlv3.q4_0.bin'
lm = Llama(model_path,
n_ctx = 2048,
n_gpu_layers = 130)
output = lm("Give me a list of famous mathematicians between born from 1800 to 2000.",
max_tokens = 1000,
stream = True)
for token in output:
print(token['choices'][0]['text'], end='', flush=True) Code output: llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 7477.72 MB (+ 1600.00 MB per state)
llama_new_context_with_model: kv self size = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/miniforge3/envs/llamaM1/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x106d3d160
ggml_metal_init: loaded kernel_add_row 0x106d3f350
ggml_metal_init: loaded kernel_mul 0x106e05250
ggml_metal_init: loaded kernel_mul_row 0x106e05a40
ggml_metal_init: loaded kernel_scale 0x106e066a0
ggml_metal_init: loaded kernel_silu 0x106e072e0
ggml_metal_init: loaded kernel_relu 0x106e05ca0
ggml_metal_init: loaded kernel_gelu 0x106e079c0
ggml_metal_init: loaded kernel_soft_max 0x107204810
ggml_metal_init: loaded kernel_diag_mask_inf 0x106e08830
ggml_metal_init: loaded kernel_get_rows_f16 0x106e08a90
ggml_metal_init: loaded kernel_get_rows_q4_0 0x106e09400
ggml_metal_init: loaded kernel_get_rows_q4_1 0x106e09cd0
ggml_metal_init: loaded kernel_get_rows_q2_K 0x106e0a3c0
ggml_metal_init: loaded kernel_get_rows_q3_K 0x106e0aa90
ggml_metal_init: loaded kernel_get_rows_q4_K 0x106e0b190
ggml_metal_init: loaded kernel_get_rows_q5_K 0x106e0b890
ggml_metal_init: loaded kernel_get_rows_q6_K 0x106e0bf90
ggml_metal_init: loaded kernel_rms_norm 0x106e0c6b0
ggml_metal_init: loaded kernel_norm 0x106e0ce20
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x106e0de10
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x106e0e620
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x12a7a4690
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x12a7a4cf0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x12a7a5cc0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x12a7a6480
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x12a7a6c10
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x12a7a7390
ggml_metal_init: loaded kernel_rope 0x12a7a53f0
ggml_metal_init: loaded kernel_alibi_f32 0x106d3e600
ggml_metal_init: loaded kernel_cpy_f32_f16 0x106d3f860
ggml_metal_init: loaded kernel_cpy_f32_f32 0x106d3fe30
ggml_metal_init: loaded kernel_cpy_f16_f16 0x106d40fc0
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 87.89 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, ( 6984.52 / 49152.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 12.00 MB, ( 6996.52 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1602.00 MB, ( 8598.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 290.00 MB, ( 8888.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 192.00 MB, ( 9080.52 / 49152.00)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
I'm looking for the most famous mathematicians of all time, and I want to know who the most influential mathematicians are in different areas of mathematics. Please provide a list of famous mathematicians that meet my criteria:
Born between 1800 and 2000
Made significant contributions to their respective fields (such as calculus, geometry, number theory, etc.)
Are widely recognized for their work and have had a lasting impact on the field of mathematics.
Here is a list of famous mathematicians that meet your criteria:
1. Carl Friedrich Gauss (1777-1855) - Gauss made significant contributions to number theory, geometry, and calculus. He is considered one of the greatest mathematicians of all time and is known as the "prince of mathematics."
2. Georg Cantor (1845-1918) - Cantor developed the theory of set theory and transfinite numbers, which revolutionized mathematics and had a lasting impact on modern mathematics.
3. David Hilbert (1862-1943) - Hilbert is known for his work on infinite-dimensional vector spaces, calculus, and number theory. He is considered one of the most important mathematicians of the 20th century.
4. Emmy Noether (1882-1935) - Noether made significant contributions to abstract algebra and is known for her work on symmetries in physics. She is considered one of the most important female mathematicians of all time.
5. Albert Einstein (1879-1955) - Einstein is known for his work on relativity, which had a lasting impact on modern physics and mathematics. He is also known for his work on Brownian motion and the photoelectric effect.
6. Andrew Wiles (1953-present) - Wiles made headlines in 1994 when he proved Fermat's Last Theorem, which had been unsolved for over 350 years. He is considered one of the most important mathematicians of the 20th century.
7. Grigori Perelman (1966-present) - Perelman made significant contributions to the field of geometry and is known for his work on the Poincaré conjecture, which was solved in 2003. He is considered one of the most important mathematicians of the 21st century.
8. Terence Tao (1975-present) - Tao is a polymath who has made significant contributions to many areas of mathematics, including harmonic analysis, partial differential equations, and number theory. He is considered one of the most important mathematicians of the 21st century.
9. Maryam Mirzakhani (1978-2017) - Mirzakhani was a brilliant mathematician who made significant contributions to the field of geometry and is known for her work on the dynamics and symmetry of curved spaces. She was the first woman to win the Fields Medal, which is considered the most prestigious award in mathematics.
10. Ngô Bảo Châu (1972-present) - Châu is a Vietnamese-French mathematician who has made significant contributions to number theory and algebraic geometry. He was awarded the Fields Medal in 2010 for his work on the Langlands program, which is a vast web of connections between different areas of mathematics.
Please note that this is not an exhaustive list, and there are many other famous mathematicians who have made significant contributions to their respective fields. However, these individuals are widely recognized as some of the most influential mathematicians of all time
llama_print_timings: load time = 2044.25 ms
llama_print_timings: sample time = 617.99 ms / 808 runs ( 0.76 ms per token, 1307.46 tokens per second)
llama_print_timings: prompt eval time = 2044.22 ms / 24 tokens ( 85.18 ms per token, 11.74 tokens per second)
llama_print_timings: eval time = 31352.04 ms / 807 runs ( 38.85 ms per token, 25.74 tokens per second)
llama_print_timings: total time = 35253.11 ms
.
ggml_metal_free: deallocating Non working code for M1 metal GPU: from llama_cpp import Llama
model_path = './llama-2-70b-chat.ggmlv3.q4_0.bin'
lm = Llama(model_path,
n_ctx = 2048,
n_gpu_layers = 130,
n_gqa = 8)
output = lm("Give me a list of famous mathematicians between born from 1800 to 2000.",
max_tokens = 1000,
stream = True)
for token in output:
print(token['choices'][0]['text'], end='', flush=True) Code output: llama.cpp: loading model from ./llama-2-70b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 37854.96 MB (+ 640.00 MB per state)
llama_new_context_with_model: kv self size = 640.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/miniforge3/envs/llamaM1/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x11ee961d0
ggml_metal_init: loaded kernel_add_row 0x11ee98480
ggml_metal_init: loaded kernel_mul 0x10eebf420
ggml_metal_init: loaded kernel_mul_row 0x10eec0120
ggml_metal_init: loaded kernel_scale 0x10eebf680
ggml_metal_init: loaded kernel_silu 0x10eec1430
ggml_metal_init: loaded kernel_relu 0x10eec0380
ggml_metal_init: loaded kernel_gelu 0x10eec1b70
ggml_metal_init: loaded kernel_soft_max 0x10eec27e0
ggml_metal_init: loaded kernel_diag_mask_inf 0x10eec2c70
ggml_metal_init: loaded kernel_get_rows_f16 0x10eec3730
ggml_metal_init: loaded kernel_get_rows_q4_0 0x10eec3df0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x10eec4680
ggml_metal_init: loaded kernel_get_rows_q2_K 0x10eec4d70
ggml_metal_init: loaded kernel_get_rows_q3_K 0x10ef93be0
ggml_metal_init: loaded kernel_get_rows_q4_K 0x10ef949d0
ggml_metal_init: loaded kernel_get_rows_q5_K 0x104218670
ggml_metal_init: loaded kernel_get_rows_q6_K 0x10ef94c30
ggml_metal_init: loaded kernel_rms_norm 0x10ef959f0
ggml_metal_init: loaded kernel_norm 0x10ef96610
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x10ef95f90
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x10ef96ed0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x10ef97780
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x10ef985a0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x10ef98ed0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x10ef99e60
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x10ef9a5f0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x10ef9b260
ggml_metal_init: loaded kernel_rope 0x10ef9b860
ggml_metal_init: loaded kernel_alibi_f32 0x10431e330
ggml_metal_init: loaded kernel_cpy_f32_f16 0x10431ef10
ggml_metal_init: loaded kernel_cpy_f32_f32 0x10431f9e0
ggml_metal_init: loaded kernel_cpy_f16_f16 0x10431fc40
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 205.08 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 36864.00 MB, offs = 0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 412.30 MB, offs = 38439649280, (37276.75 / 49152.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 24.00 MB, (37300.75 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 642.00 MB, (37942.75 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 456.00 MB, (38398.75 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 304.00 MB, (38702.75 / 49152.00)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
GGML_ASSERT: /private/var/folders/fw/wjnxhm6n7bv6bwlk4pkxtdq00000gp/T/pip-install-lt5z7o3y/llama-cpp-python_6789b9807ac84e2ab2c3dcb9e071c493/vendor/llama.cpp/ggml-metal.m:612: ne02 == ne12
GGML_ASSERT: /private/var/folders/fw/wjnxhm6n7bv6bwlk4pkxtdq00000gp/T/pip-install-lt5z7o3y/llama-cpp-python_6789b9807ac84e2ab2c3dcb9e071c493/vendor/llama.cpp/ggml-metal.m:612: ne02 == ne12
Abort trap: 6 |
Got the same issue while use GPU metal The But the
|
I compiled with "LLAMA_METAL=1 make" on M2 Max
./main -m ./models/13B/llama-2-13b-chat.ggmlv3.q8_0.bin -ngl 8
should at least not throw any error (I know I have to specify more specific params).
throw
GGML_ASSERT: ggml-metal.m:905: false && "not implemented"
zsh: abort ./main -m ./models/13B/llama-2-13b-chat.ggmlv3.q8_0.bin --temp 0.0 -n -1 1.1
I am Apple M2 Max, got the weights from https://huggingface.co/TheBloke...
I tried q4 version and it worked. So is this not supported for q8?
The text was updated successfully, but these errors were encountered: