-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce temp=0 llama.cpp results with some consistency. #28
Comments
Would something like the following: google/gemma.cpp#23 be happening here? Basically the way quantization is implemented seems to result in lower perf on some types of architectures. |
Exactly. It seems that quantizing the hidden state to q8_0 is not a good idea (see ggerganov/llama.cpp#4755; it is unfortunate that the bot closed this). |
FWIW we're (gemma.cpp) actually using fp32. |
With 42001c5, the zero-temperature behavior now better matches the Metal backend of llama.cpp: Llama2.jl (at 42001c5):
llama.cpp (at ggerganov/llama.cpp@637e9a8):
This is using the |
This is now fixed with the new vecdot routines: 587d270. |
We need to find a way to detect what could cause the differences between the two solutions.
The task is to have the same or near similar results at temp=0. We made some tests with the new
.gguf
files since it got so huge adoption.Llama2.jl test:
llama.cpp test .gguf test:
./main -m /Users/lukasmayrhofer/Downloads/llama-2-7b-chat.Q4_K_S.gguf --samplers "temp" --temp 0 -p "Tim was happy."
Current Llama2.jl results:
Current llama.cpp results:
We need to find an efficient way to know what could cause the differences between the two.
The text was updated successfully, but these errors were encountered: