Reproduce temp=0 llama.cpp results with some consistency. #28

Sixzero · 2024-01-02T14:33:05Z

We need to find a way to detect what could cause the differences between the two solutions.

The task is to have the same or near similar results at temp=0. We made some tests with the new .gguf files since it got so huge adoption.

Llama2.jl test:

using Llama2
model = load_gguf_model("/path/to/llama-2-7b-chat.Q4_K_S.gguf");
sample(model, "Tim was happy."; temperature = 0.0f0)

llama.cpp test .gguf test:
./main -m /Users/lukasmayrhofer/Downloads/llama-2-7b-chat.Q4_K_S.gguf --samplers "temp" --temp 0 -p "Tim was happy."

Current Llama2.jl results:

Tim was happy. Einzelnes, but he was also very proud of his son. He had always known that Tim was special, and he was thrilled to see him finally getting the recognition he deserved.\nAs the two of them sat in the stands, watching the game, Tim couldn't help but feel a sense of pride and joy. He was so grateful to have" ⋯ 667 bytes ⋯ ". \"I'm lucky to have you too.\"\nAs they walked out of the restaurant, Tim felt a sense of contentment and happiness. He knew that he had a wonderful son, and he was grateful for every moment they spent together. He was proud of Tim, and he knew that he would always be there to support and encourage him, no matter what.

Current llama.cpp results:

Tim was happy.
He had just received a new job offer and he was excited to start his new career. He had been searching for a new opportunity for months, and now it seemed like all his hard work had paid off.
As he walked into the office building, he couldn't help but feel a sense of pride. He had worked hard to get where he was, and he knew that this new job would be a great opportunity for him.
Tim took a deep breath as he entered the office. He was greeted by a friendly receptionist who offered him a warm smile. "Hello there," she said. "Welcome to Tim's new workplace."
Tim felt a sense of excitement as he walked through the office. He couldn't wait to meet his new colleagues and start working on his new projects. He knew that this was going to be a great opportunity for him, and he was eager to get started. [end of text]

We need to find an efficient way to know what could cause the differences between the two.

The text was updated successfully, but these errors were encountered:

krishvishal · 2024-04-02T15:57:19Z

Would something like the following: google/gemma.cpp#23 be happening here?

Basically the way quantization is implemented seems to result in lower perf on some types of architectures.

cafaxo · 2024-04-02T21:19:22Z

Exactly. It seems that quantizing the hidden state to q8_0 is not a good idea (see ggerganov/llama.cpp#4755; it is unfortunate that the bot closed this).
We should rewrite our quantized vecdot routines to do the calculations in fp16 or fp32. The challenge here is to not degrade the speed of the vecdots too much.

jan-wassenberg · 2024-04-05T13:32:42Z

FWIW we're (gemma.cpp) actually using fp32.

cafaxo · 2024-04-22T13:06:05Z

With 42001c5, the zero-temperature behavior now better matches the Metal backend of llama.cpp:

Llama2.jl (at 42001c5):

julia> sample(model, "The Julia programming language."; temperature=0.0f0)
 The Julia programming language. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of packages and libraries.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of packages and libraries.
## Installation

### Installing Julia

#### Installing Julia from the Julia website

llama.cpp (at ggerganov/llama.cpp@637e9a8):

 The Julia programming language. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.

This is using the llama-2-7b.Q4_K_S.gguf model.

cafaxo · 2024-07-18T19:05:06Z

This is now fixed with the new vecdot routines: 587d270.

cafaxo closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce temp=0 llama.cpp results with some consistency. #28

Reproduce temp=0 llama.cpp results with some consistency. #28

Sixzero commented Jan 2, 2024

krishvishal commented Apr 2, 2024 •

edited

Loading

cafaxo commented Apr 2, 2024

jan-wassenberg commented Apr 5, 2024

cafaxo commented Apr 22, 2024

cafaxo commented Jul 18, 2024

Reproduce temp=0 llama.cpp results with some consistency. #28

Reproduce temp=0 llama.cpp results with some consistency. #28

Comments

Sixzero commented Jan 2, 2024

krishvishal commented Apr 2, 2024 • edited Loading

cafaxo commented Apr 2, 2024

jan-wassenberg commented Apr 5, 2024

cafaxo commented Apr 22, 2024

cafaxo commented Jul 18, 2024

krishvishal commented Apr 2, 2024 •

edited

Loading