Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump llama-cpp-python to 0.2.18 #4611

Merged
merged 10 commits into from
Nov 17, 2023
Merged

Bump llama-cpp-python to 0.2.18 #4611

merged 10 commits into from
Nov 17, 2023

Conversation

oobabooga
Copy link
Owner

GPU offloading didn't work in 0.2.17, and now 0.2.18 crashes with "Illegal instruction (core dumped)" when I try to load a model. I'll leave this PR here until this gets figured out.

@oobabooga oobabooga changed the base branch from main to dev November 16, 2023 01:33
@mjameson
Copy link

Thanks, this enables Falcon 180B functionality on my Mac Studio.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Nov 16, 2023

I am not having any issues using the MMQ kernels. Maybe lost .3 t/s on empty context 70b. Need to add new Min_P to pure .cpp and all the samplers exllama added too.

@oobabooga
Copy link
Owner Author

I think that the missing exllamav2 samplers should be covered now 58c6001

@oobabooga
Copy link
Owner Author

oobabooga commented Nov 16, 2023

Min_P cannot be added to the llama.cpp loader yet because while the python bindings for the C++ sampling functions are available, the parameter is not implemented in the _create_completion function:

https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L1294

I added the (apparently) new "seed" generation parameter though, which replaces the previous useless "seed" loading parameter.


I'm studying the possibility of removing the "CPU only" version of llama-cpp-python from the requirements, and installing only a single version for CUDA / AMD users with the original "llama_cpp" namespace instead of "llamacpp_cuda".

The modified namespace seems to be causing problems, and I don't see the use case for 2 libraries now that the installer handles the AVX2, no AVX2, CUDA, and CPU only cases automatically.

Requesting feedback from @jllllll on whether this makes sense.

@oobabooga oobabooga merged commit 923c8e2 into dev Nov 17, 2023
oobabooga added a commit that referenced this pull request Nov 17, 2023
@jllllll
Copy link
Contributor

jllllll commented Nov 17, 2023

@oobabooga
The original reason for having separate packages for both the CPU and CUDA versions was to allow for easier testing since the CUDA version can't fully switch off the CUDA code. There were some other reasons as well, but I don't remember them.

I am working on a fix for the issues that have arisen with this.
You can see some discussion as to the cause here:
jllllll/llama-cpp-python-cuBLAS-wheels#21
abetlen/llama-cpp-python#922

@oobabooga
Copy link
Owner Author

@jllllll I was inclined to keep only the CUDA wheels for simplicity. Some people were confused by the "cpu" checkbox in the llama.cpp loader, and I also haven't seen anyone using the "cpu" option recently. But if you feel like it's best to keep this option, then we can keep it.

I am at this very moment trying to build wheels using your workflows with the -DLLAMA_CUDA_FORCE_MMQ=ON flag added, as without this latest llama.cpp performance drops immensely for GPUs without tensor cores. See the reports here, here, and here.

There was also a report of higher memory usage on a Mac in the latest version; I don't know what is up with that.

Here is my commit (I was going to PR it to you later if it works): oobabooga/llama-cpp-python-cuBLAS-wheels@beb1b54

My wheels are still building so I haven't tested them. Your workflows are mindblowing by the way.

@jllllll
Copy link
Contributor

jllllll commented Nov 17, 2023

I'll do a local build with that and test it on my 1080ti to see what the performance difference is.
Does that flag have a negative impact on newer GPU performance?

@oobabooga
Copy link
Owner Author

oobabooga commented Nov 17, 2023

Yes, it makes fully offloaded performance for a 13b model on a 3090 go from ~45 tokens/second to ~30 tokens/second (or something like that). This optimization was introduced in ggerganov/llama.cpp#3776, but it doesn't work for all GPUs.

I think that there are plans for detecting the GPU model automatically at runtime in llama.cpp, but for now, the switch has to be made at compile time. I think that the best we can do is target the lower end GPUs until that update happens in llama.cpp.

@jllllll
Copy link
Contributor

jllllll commented Nov 17, 2023

Without flag: 20-23 t/s
With flag: 24-37 t/s

Hopefully they will add GPU detection soon. That is something that has been needed for a while now.
They could just as easily hard-code the MMQ code for non-tensor GPUs in C++. No detection needed.
I'll start rebuilding llama-cpp-python-cuda wheels for 0.2.18. After that, I'll go through 0.2.14-0.2.17.

@oobabooga
Copy link
Owner Author

I made a test on my GTX 1650 and couldn't get the old performance back with my workflow wheels:

0.2.11:

llama_print_timings:        load time =  8880.86 ms
llama_print_timings:      sample time =   109.16 ms /   200 runs   (    0.55 ms per token,  1832.16 tokens per second)
llama_print_timings: prompt eval time = 78151.46 ms /  3200 tokens (   24.42 ms per token,    40.95 tokens per second)
llama_print_timings:        eval time = 133905.11 ms /   199 runs   (  672.89 ms per token,     1.49 tokens per second)
llama_print_timings:       total time = 212725.49 ms

0.2.18 without -DLLAMA_CUDA_FORCE_MMQ=ON:

llama_print_timings:        load time =   41630.44 ms
llama_print_timings:      sample time =     126.56 ms /   200 runs   (    0.63 ms per token,  1580.24 tokens per second)
llama_print_timings: prompt eval time =  289150.58 ms /  3200 tokens (   90.36 ms per token,    11.07 tokens per second)
llama_print_timings:        eval time =  147013.30 ms /   199 runs   (  738.76 ms per token,     1.35 tokens per second)
llama_print_timings:       total time =  437592.15 ms

0.2.18 with -DLLAMA_CUDA_FORCE_MMQ=ON (supposedly):

https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/wheels/llama_cpp_python-0.2.18+cu121-cp311-cp311-manylinux_2_31_x86_64.whl

llama_print_timings:        load time =   41533.16 ms
llama_print_timings:      sample time =     121.64 ms /   200 runs   (    0.61 ms per token,  1644.17 tokens per second)
llama_print_timings: prompt eval time =  287120.45 ms /  3200 tokens (   89.73 ms per token,    11.15 tokens per second)
llama_print_timings:        eval time =  138205.92 ms /   199 runs   (  694.50 ms per token,     1.44 tokens per second)
llama_print_timings:       total time =  426661.81 ms

Most likely I put -DLLAMA_CUDA_FORCE_MMQ=ON in the wrong places.

@jllllll
Copy link
Contributor

jllllll commented Nov 17, 2023

This is what I did: jllllll/llama-cpp-python-cuBLAS-wheels@f6d1e53
I didn't use that flag for AMD GPU builds as I don't know what effect that flag will have or if it has one at all on AMD.

@oobabooga
Copy link
Owner Author

I think that your wheels will work, my logs say that MMQ was not being forced even though it should be:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes

@jllllll
Copy link
Contributor

jllllll commented Nov 18, 2023

CUDA wheels are built. Currently building the ROCm wheels.

All of the relevant 0.2.18 wheels should be rebuilt now.

@oobabooga
Copy link
Owner Author

oobabooga commented Nov 18, 2023

Thank you! This fixes the broken prompt processing times. The speed is not better than what it was in late September, but at least it's not a lot worse:

0.2.11:

llama_print_timings:        load time =  8880.86 ms
llama_print_timings:      sample time =   109.16 ms /   200 runs   (    0.55 ms per token,  1832.16 tokens per second)
llama_print_timings: prompt eval time = 78151.46 ms /  3200 tokens (   24.42 ms per token,    40.95 tokens per second)
llama_print_timings:        eval time = 133905.11 ms /   199 runs   (  672.89 ms per token,     1.49 tokens per second)
llama_print_timings:       total time = 212725.49 ms

0.2.18 (jllllll version):

llama_print_timings:        load time =    9984.09 ms
llama_print_timings:      sample time =     124.98 ms /   200 runs   (    0.62 ms per token,  1600.19 tokens per second)
llama_print_timings: prompt eval time =   90034.89 ms /  3200 tokens (   28.14 ms per token,    35.54 tokens per second)
llama_print_timings:        eval time =  141965.92 ms /   199 runs   (  713.40 ms per token,     1.40 tokens per second)
llama_print_timings:       total time =  233312.03 ms

0.2.18 without -DLLAMA_CUDA_FORCE_MMQ=ON:

llama_print_timings:        load time =   41630.44 ms
llama_print_timings:      sample time =     126.56 ms /   200 runs   (    0.63 ms per token,  1580.24 tokens per second)
llama_print_timings: prompt eval time =  289150.58 ms /  3200 tokens (   90.36 ms per token,    11.07 tokens per second)
llama_print_timings:        eval time =  147013.30 ms /   199 runs   (  738.76 ms per token,     1.35 tokens per second)
llama_print_timings:       total time =  437592.15 ms

I have kept the llama_cpp_cuda libraries and the cpu option in this new PR: #4637

@oobabooga oobabooga deleted the llamacpp-bump branch November 18, 2023 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants