Several GPU models behave erratically compared to CPU execution #12374

pepijndevos · 2024-11-10T20:29:00Z

Here is a trace from my Intel Arc A770 via Docker:

$ ollama run  deepseek-coder-v2
>>> write fizzbuzz
"""""""""""""""""""""""""""""""

And here is an trace from Arch linux running on CPU:

$ ollama run  deepseek-coder-v2 
>>> write fizzbuzz
 Certainly! FizzBuzz is a classic programming task, often used in job interviews to test basic understanding of loops and conditionals. The task goes like this:

1. Print numbers from 1 to 100.
2. For multiples of 3, print "Fizz".
3. For multiples of 5, print "Buzz".
4. For multiples of both 3 and 5 (i.e., multiples of 15), print "FizzBuzz".

Here's a simple implementation in Python:

for i in range(1, 101):
    if i % 15 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)

This code will output the numbers from 1 to 100, replacing multiples of 3 with "Fizz", multiples of 5 with "Buzz", and multiples of both 3 and 5 with "FizzBuzz".

For Docker I'm using https://github.com/mattcurf/ollama-intel-gpu due to #12372

ollama logs:

ollama-intel-gpu  | time=2024-11-10T20:25:23.772Z level=INFO source=server.go:395 msg="starting llama server" cmd="/tmp/ollama3494697786/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 40951"
ollama-intel-gpu  | time=2024-11-10T20:25:23.773Z level=INFO source=sched.go:450 msg="loaded runners" count=1
ollama-intel-gpu  | time=2024-11-10T20:25:23.773Z level=INFO source=server.go:595 msg="waiting for llama runner to start responding"
ollama-intel-gpu  | time=2024-11-10T20:25:23.773Z level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
ollama-intel-gpu  | INFO [main] build info | build=1 commit="6cbbf2a" tid="139094668663808" timestamp=1731270323
ollama-intel-gpu  | INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139094668663808" timestamp=1731270323 total_threads=32
ollama-intel-gpu  | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="40951" tid="139094668663808" timestamp=1731270323
ollama-intel-gpu  | llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /root/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 (version GGUF V3 (latest))
ollama-intel-gpu  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-intel-gpu  | llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
ollama-intel-gpu  | llama_model_loader: - kv   1:                               general.name str              = DeepSeek-Coder-V2-Lite-Instruct
ollama-intel-gpu  | llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
ollama-intel-gpu  | llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
ollama-intel-gpu  | llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
ollama-intel-gpu  | llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
ollama-intel-gpu  | llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
ollama-intel-gpu  | llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
ollama-intel-gpu  | llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
ollama-intel-gpu  | llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
ollama-intel-gpu  | llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
ollama-intel-gpu  | llama_model_loader: - kv  11:                          general.file_type u32              = 2
ollama-intel-gpu  | llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
ollama-intel-gpu  | llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
ollama-intel-gpu  | llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
ollama-intel-gpu  | llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
ollama-intel-gpu  | llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
ollama-intel-gpu  | llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
ollama-intel-gpu  | llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
ollama-intel-gpu  | llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
ollama-intel-gpu  | llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
ollama-intel-gpu  | llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
ollama-intel-gpu  | llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
ollama-intel-gpu  | llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
ollama-intel-gpu  | llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
ollama-intel-gpu  | llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
ollama-intel-gpu  | llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
ollama-intel-gpu  | llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
ollama-intel-gpu  | llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-intel-gpu  | llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-intel-gpu  | llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
ollama-intel-gpu  | llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
ollama-intel-gpu  | llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
ollama-intel-gpu  | llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
ollama-intel-gpu  | llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
ollama-intel-gpu  | llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
ollama-intel-gpu  | llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
ollama-intel-gpu  | llama_model_loader: - kv  37:               general.quantization_version u32              = 2
ollama-intel-gpu  | llama_model_loader: - type  f32:  108 tensors
ollama-intel-gpu  | llama_model_loader: - type q4_0:  268 tensors
ollama-intel-gpu  | llama_model_loader: - type q6_K:    1 tensors
ollama-intel-gpu  | llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
ollama-intel-gpu  | llm_load_vocab: special tokens cache size = 2400
ollama-intel-gpu  | llm_load_vocab: token to piece cache size = 0.6661 MB
ollama-intel-gpu  | llm_load_print_meta: format           = GGUF V3 (latest)
ollama-intel-gpu  | llm_load_print_meta: arch             = deepseek2
ollama-intel-gpu  | llm_load_print_meta: vocab type       = BPE
ollama-intel-gpu  | llm_load_print_meta: n_vocab          = 102400
ollama-intel-gpu  | llm_load_print_meta: n_merges         = 99757
ollama-intel-gpu  | llm_load_print_meta: vocab_only       = 0
ollama-intel-gpu  | llm_load_print_meta: n_ctx_train      = 163840
ollama-intel-gpu  | llm_load_print_meta: n_embd           = 2048
ollama-intel-gpu  | llm_load_print_meta: n_layer          = 27
ollama-intel-gpu  | llm_load_print_meta: n_head           = 16
ollama-intel-gpu  | llm_load_print_meta: n_head_kv        = 16
ollama-intel-gpu  | llm_load_print_meta: n_rot            = 64
ollama-intel-gpu  | llm_load_print_meta: n_swa            = 0
ollama-intel-gpu  | llm_load_print_meta: n_embd_head_k    = 192
ollama-intel-gpu  | llm_load_print_meta: n_embd_head_v    = 128
ollama-intel-gpu  | llm_load_print_meta: n_gqa            = 1
ollama-intel-gpu  | llm_load_print_meta: n_embd_k_gqa     = 3072
ollama-intel-gpu  | llm_load_print_meta: n_embd_v_gqa     = 2048
ollama-intel-gpu  | llm_load_print_meta: f_norm_eps       = 0.0e+00
ollama-intel-gpu  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
ollama-intel-gpu  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
ollama-intel-gpu  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-intel-gpu  | llm_load_print_meta: f_logit_scale    = 0.0e+00
ollama-intel-gpu  | llm_load_print_meta: n_ff             = 10944
ollama-intel-gpu  | llm_load_print_meta: n_expert         = 64
ollama-intel-gpu  | llm_load_print_meta: n_expert_used    = 6
ollama-intel-gpu  | llm_load_print_meta: causal attn      = 1
ollama-intel-gpu  | llm_load_print_meta: pooling type     = 0
ollama-intel-gpu  | llm_load_print_meta: rope type        = 0
ollama-intel-gpu  | llm_load_print_meta: rope scaling     = yarn
ollama-intel-gpu  | llm_load_print_meta: freq_base_train  = 10000.0
ollama-intel-gpu  | llm_load_print_meta: freq_scale_train = 0.025
ollama-intel-gpu  | llm_load_print_meta: n_ctx_orig_yarn  = 4096
ollama-intel-gpu  | llm_load_print_meta: rope_finetuned   = unknown
ollama-intel-gpu  | llm_load_print_meta: ssm_d_conv       = 0
ollama-intel-gpu  | llm_load_print_meta: ssm_d_inner      = 0
ollama-intel-gpu  | llm_load_print_meta: ssm_d_state      = 0
ollama-intel-gpu  | llm_load_print_meta: ssm_dt_rank      = 0
ollama-intel-gpu  | llm_load_print_meta: ssm_dt_b_c_rms   = 0
ollama-intel-gpu  | llm_load_print_meta: model type       = 16B
ollama-intel-gpu  | llm_load_print_meta: model ftype      = Q4_0
ollama-intel-gpu  | llm_load_print_meta: model params     = 15.71 B
ollama-intel-gpu  | llm_load_print_meta: model size       = 8.29 GiB (4.53 BPW) 
ollama-intel-gpu  | llm_load_print_meta: general.name     = DeepSeek-Coder-V2-Lite-Instruct
ollama-intel-gpu  | llm_load_print_meta: BOS token        = 100000 '<｜begin▁of▁sentence｜>'
ollama-intel-gpu  | llm_load_print_meta: EOS token        = 100001 '<｜end▁of▁sentence｜>'
ollama-intel-gpu  | llm_load_print_meta: PAD token        = 100001 '<｜end▁of▁sentence｜>'
ollama-intel-gpu  | llm_load_print_meta: LF token         = 126 'Ä'
ollama-intel-gpu  | llm_load_print_meta: EOG token        = 100001 '<｜end▁of▁sentence｜>'
ollama-intel-gpu  | llm_load_print_meta: max token length = 256
ollama-intel-gpu  | llm_load_print_meta: n_layer_dense_lead   = 1
ollama-intel-gpu  | llm_load_print_meta: n_lora_q             = 0
ollama-intel-gpu  | llm_load_print_meta: n_lora_kv            = 512
ollama-intel-gpu  | llm_load_print_meta: n_ff_exp             = 1408
ollama-intel-gpu  | llm_load_print_meta: n_expert_shared      = 2
ollama-intel-gpu  | llm_load_print_meta: expert_weights_scale = 1.0
ollama-intel-gpu  | llm_load_print_meta: rope_yarn_log_mul    = 0.0707
ollama-intel-gpu  | time=2024-11-10T20:25:24.024Z level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model"
ollama-intel-gpu  | ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ollama-intel-gpu  | ggml_sycl_init: SYCL_USE_XMX: yes
ollama-intel-gpu  | ggml_sycl_init: found 1 SYCL devices:
ollama-intel-gpu  | llm_load_tensors: ggml ctx size =    0.32 MiB
ollama-intel-gpu  | llm_load_tensors: offloading 27 repeating layers to GPU
ollama-intel-gpu  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-intel-gpu  | llm_load_tensors: offloaded 28/28 layers to GPU
ollama-intel-gpu  | llm_load_tensors:      SYCL0 buffer size =  8376.27 MiB
ollama-intel-gpu  | llm_load_tensors:  SYCL_Host buffer size =   112.50 MiB
ollama-intel-gpu  | llama_new_context_with_model: n_ctx      = 8192
ollama-intel-gpu  | llama_new_context_with_model: n_batch    = 512
ollama-intel-gpu  | llama_new_context_with_model: n_ubatch   = 512
ollama-intel-gpu  | llama_new_context_with_model: flash_attn = 0
ollama-intel-gpu  | llama_new_context_with_model: freq_base  = 10000.0
ollama-intel-gpu  | llama_new_context_with_model: freq_scale = 0.025
ollama-intel-gpu  | [SYCL] call ggml_check_sycl
ollama-intel-gpu  | ggml_check_sycl: GGML_SYCL_DEBUG: 0
ollama-intel-gpu  | ggml_check_sycl: GGML_SYCL_F16: no
ollama-intel-gpu  | found 1 SYCL devices:
ollama-intel-gpu  | |  |                   |                                       |       |Max    |        |Max  |Global |                     |
ollama-intel-gpu  | |  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
ollama-intel-gpu  | |ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
ollama-intel-gpu  | |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
ollama-intel-gpu  | | 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.6|    512|    1024|   32| 16225M|            1.3.31294|
ollama-intel-gpu  | llama_kv_cache_init:      SYCL0 KV buffer size =  2160.00 MiB
ollama-intel-gpu  | llama_new_context_with_model: KV self size  = 2160.00 MiB, K (f16): 1296.00 MiB, V (f16):  864.00 MiB
ollama-intel-gpu  | llama_new_context_with_model:  SYCL_Host  output buffer size =     1.59 MiB
ollama-intel-gpu  | llama_new_context_with_model:      SYCL0 compute buffer size =   339.13 MiB
ollama-intel-gpu  | llama_new_context_with_model:  SYCL_Host compute buffer size =    38.01 MiB
ollama-intel-gpu  | llama_new_context_with_model: graph nodes  = 1951
ollama-intel-gpu  | llama_new_context_with_model: graph splits = 110
ollama-intel-gpu  | [1731270330] warming up the model with an empty run
ollama-intel-gpu  | INFO [main] model loaded | tid="139094668663808" timestamp=1731270335
ollama-intel-gpu  | time=2024-11-10T20:25:35.563Z level=INFO source=server.go:634 msg="llama runner started in 11.79 seconds"
ollama-intel-gpu  | [GIN] 2024/11/10 - 20:25:35 | 200 |  11.81337377s |       127.0.0.1 | POST     "/api/chat"
ollama-intel-gpu  | [GIN] 2024/11/10 - 20:25:46 | 200 |       22.86µs |       127.0.0.1 | HEAD     "/"
ollama-intel-gpu  | [GIN] 2024/11/10 - 20:25:46 | 200 |    6.807262ms |       127.0.0.1 | POST     "/api/show"
ollama-intel-gpu  | [GIN] 2024/11/10 - 20:25:46 | 200 |    6.526006ms |       127.0.0.1 | POST     "/api/chat"
ollama-intel-gpu  | check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
ollama-intel-gpu  | [GIN] 2024/11/10 - 20:25:59 | 200 |  9.400866991s |       127.0.0.1 | POST     "/api/chat"

The text was updated successfully, but these errors were encountered:

sgwhat · 2024-11-11T02:25:48Z

Hi @pepijndevos , we have reproduced your issue and are working on finding a solution. We will inform you ASAP.

pepijndevos · 2024-11-16T15:00:29Z

I ran into similar but less obvious problems where qwen2.5-coder:14b will just get stuck int repeating patterns or suddenly start talking about something completely different, while running on CPU reliably produces sensible results.

| Q      | Output| Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data

rynprrk · 2024-11-25T20:37:40Z

I ran into similar but less obvious problems where qwen2.5-coder:14b will just get stuck int repeating patterns or suddenly start talking about something completely different, while running on CPU reliably produces sensible results.
| Q      | Output| Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data

I was able to reproduce the issue. I have a burning suspicion that this has to do with the way memory is being shared. I am running Arc A750 with iGPU disabled. Since the card only have 8GB of GDDR6, I can realistically only load one 8b parameter model reliably. When loading multiple models (where total memory >8GB) I see similar behavior.

My speculation is that something is going wrong when accessing models that share GPU and system memory.

qiuxin2012 · 2024-11-26T02:15:13Z

I ran into similar but less obvious problems where qwen2.5-coder:14b will just get stuck int repeating patterns or suddenly start talking about something completely different, while running on CPU reliably produces sensible results.
| Q      | Output| Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data

Maybe we have fixed this 2 weeks ago, could you update your ipex-llm and try again?

pepijndevos · 2024-11-26T11:40:15Z

I don't have an easily reproducible testcase for the repeating patterns, but the original issue persists:

deepseek-coder-v2:latest
"""""""""""""""""""""""""""""""

this is after docker pull, and running pip install --pre --upgrade ipex-llm[cpp] inside the image as well.

rynprrk · 2024-11-26T18:02:25Z

Here are a couple of examples of wrong output I was able to produce. I ran these queries 'one after the other'.

Access ollama instance:
C:\Users\sandbox\watsonx-code-assistant\ollama>ollama run granite-code:8b --verbose

OK Output #1

>>> Tell me a story
Once upon a time, in a far-off land, there was a young prince named Prince Pilgrim. He was a
kind and generous man who loved to help others. One day, he received a summons from the king of
the neighboring kingdoms. The king had heard that Prince Pilgrim was on a quest for treasure and
asked for his help in finding it.

Prince Pilgrim accepted the king's invitation and set out on his journey, traveling through the
land and meeting many new friends along the way. He encountered all sorts of challenges and
obstacles, but he never lost sight of his goal. Finally, after several days of travel, he came
upon a hidden treasure in a remote cave.

The king was overjoyed to hear that Prince Pilgrim had found the treasure and presented him with
a beautiful golden chest filled with jewels and other valuable items. Prince Pilgrim was
grateful for the kindness of the king and the blessings of his friends, and he returned home to
share his wealth with the people of his kingdom.

From that day on, Prince Pilgrim continued to live a life of generosity and kindness, spreading
the love of money and treasure wherever he went. He became known as a wise and just ruler, and
his kingdom was celebrated for its prosperity and good governance for centuries to come.


total duration:       5.555094s
load duration:        10.1805ms
prompt eval count:    12 token(s)
prompt eval duration: 167.7ms
prompt eval rate:     71.56 tokens/s
eval count:           324 token(s)
eval duration:        5.376107s
eval rate:            60.27 tokens/s

OK Output #2

>>> write an example C++ program that sorts an array of 1000 numbers ranging between 0 and 9999 using
... mergesort algorithm. Explain what each function does
Here is a sample C++ program that implements the Merge Sort algorithm to sort an array of 1000
numbers between 0 and 9999:

```c++
#include <iostream>
#include <vector>
using namespace std;

void merge(vector<int>& arr, int left, int mid, int right) {
    // Merge two sorted subarrays arr[left...mid] and arr[mid+1...right] into a sorted array
arr[left...right]
    vector<int> temp(arr.begin() + left, arr.begin() + right + 1);
    int i = 0;
    int j = mid - left + 1;
    int k = left;

    while (i < j && k <= right) {
        if (temp[i] < temp[j]) {
            arr[k] = temp[i];
            i++;
        } else {
            arr[k] = temp[j];
            j++;
        }
        k++;
    }

    // Copy remaining elements from left subarray
    while (i < j) {
        arr[k] = temp[i];
        i++;
        k++;
    }
}

void mergeSort(vector<int>& arr, int left, int right) {
    if (left < right) {
        int mid = left + (right - left) / 2;

        // Sort first and second halves
        mergeSort(arr, left, mid);
        mergeSort(arr, mid + 1, right);

        // Merge two sorted subarrays into a single sorted array
        merge(arr, left, mid, right);
    }
}

int main() {
    vector<int> arr(1000);

    for (int i = 0; i < 1000; i++) {
        arr[i] = rand() % 10000; // Fill the array with random numbers between 0 and 9999
    }

    cout << "Unsorted array: ";
    for (int num : arr) {
        cout << num << " ";
    }
    cout << endl;

    mergeSort(arr, 0, 999);

    cout << "Sorted array: ";
    for (int num : arr) {
        cout << num << " ";
    }
    cout << endl;

    return 0;
}
\```

The program starts by defining two functions: `merge` and `mergeSort`. The `merge` function
takes four arguments: a reference to the array to be sorted (`arr`), the left and right indices
of the subarray being merged. It merges two sorted subarrays into a single sorted array using
the Merge Sort algorithm.

The `mergeSort` function also takes four arguments: a reference to the array to be sorted
(`arr`), the left and right indices of the subarray being sorted. It recursively splits the
subarray into smaller halves until each subarray contains only one element, and then merges the
sorted subarrays back together using the `merge` function.

In the `main` function, a vector `arr` is initialized with 1000 elements, each of which is
assigned a random number between 0 and 9999. The array is then printed to the console before
being sorted using the `mergeSort` function. Finally, the sorted array is printed to the console
again.


total duration:       12.8139359s
load duration:        9.9956ms
prompt eval count:    383 token(s)
prompt eval duration: 228.527ms
prompt eval rate:     1675.95 tokens/s
eval count:           729 token(s)
eval duration:        12.5698s
eval rate:            58.00 tokens/s

Issue starts here

>>> write an example C++ program that sorts an array of 1000 numbers ranging between 0 and 9999 using
... quick sort algorithm. Explain what each function does
Here is a sample C++ program that implements the Quick Sort algorithm to sort an array of 1000
numbers between 0 and 9999:

```c++
#include <iostream>
#include <vector>
using namespace std;

int partition(vector<int>& arr, int left, int right) {
    // Select a pivot element from the array
    int pivot = arr[right];

    // Partition the array into two subarrays: elements less than or equal to the pivot and
elements greater than than than than than than than than than than than than than than than than
than than than than than than than than than than than than than than


total duration:       4.0945746s
load duration:        11.7049ms
prompt eval count:    1791 token(s)
prompt eval duration: 1.423731s
prompt eval rate:     1257.96 tokens/s
eval count:           144 token(s)
eval duration:        2.629295s
eval rate:            54.77 tokens/s

>>> write an example C++ program that sorts an array of 1000 numbers ranging between 0 and 9999 using
... heap sort algorithm. Explain what each function does
Here is a sample C++ program that implements the Heap Sort algorithm to sort an array of 1000
numbers between 0 and 9999:

```c++
#include <iostream>
#include <vector>
using namespace std;

void heapify(vector<int>&>>>.>..<<<..>>>.++.>,.[,.[]<-.",
    "v(">><,>[",
    "v([<<]+>>+)+<.>",
    ".<<<.>.>>>.<<<<<.",
    "<.<.>>>.<<<'<-'>>>-.",
    "v(.<<<..>>>.<<<'<-'>>>-)",
    "v(.<<.>>>.<<<'<-'>>>-",
    "v(<.<<<..>>>.<<<'<-'>>>-",
    "v(<<.<<<..>>>.<<<'<-'>>>-",
    "v(.<<<..>>>.<<<'<-'>>>-"
  ],
  "trim": true
}


total duration:       3.827916s
load duration:        11.8317ms
prompt eval count:    1982 token(s)
prompt eval duration: 157.058ms
prompt eval rate:     12619.54 tokens/s
eval count:           210 token(s)
eval duration:        3.627043s
eval rate:            57.90 tokens/s

Here's my ollama log:

time=2024-11-26T09:48:11.201-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=37 layers.offload=0 layers.split="" memory.available="[22.8 GiB]" memory.required.full="6.1 GiB" memory.required.partial="0 B" memory.required.kv="1.1 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="5.2 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"
time=2024-11-26T09:48:11.203-07:00 level=INFO source=server.go:395 msg="starting llama server" cmd="C:\\Users\\sandbox\\watsonx-code-assistant\\ollama\\dist\\windows-amd64\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\sandbox\\.ollama\\models\\blobs\\sha256-bf481f838ba0e13524bde2f44cfd57a0eefb1c422da885fb26ca6cf12bea11fa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 54688"
time=2024-11-26T09:48:11.221-07:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-11-26T09:48:11.234-07:00 level=INFO source=server.go:595 msg="waiting for llama runner to start responding"
time=2024-11-26T09:48:11.234-07:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=1 commit="487f188" tid="8496" timestamp=1732639691
INFO [wmain] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="8496" timestamp=1732639691 total_threads=24
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="54688" tid="8496" timestamp=1732639691
llama_model_loader: loaded meta data with 33 key-value pairs and 578 tensors from C:\Users\sandbox\.ollama\models\blobs\sha256-bf481f838ba0e13524bde2f44cfd57a0eefb1c422da885fb26ca6cf12bea11fa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Granite 8b Code Instruct 128k
llama_model_loader: - kv   3:                           general.finetune str              = code-instruct-128k
llama_model_loader: - kv   4:                           general.basename str              = granite
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,3]       = ["code", "granite", "text-generation"]
llama_model_loader: - kv   8:                           general.datasets arr[str,9]       = ["bigcode/commitpackft", "TIGER-Lab/M...
llama_model_loader: - kv   9:                          llama.block_count u32              = 36
llama_model_loader: - kv  10:                       llama.context_length u32              = 128000
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 2
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 49152
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = refact
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,49152]   = ["<|endoftext|>", "<fim_prefix>", "<f...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,49152]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,48891]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  325 tensors
llama_model_loader: - type q4_0:  252 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 19
llm_load_vocab: token to piece cache size = 0.2826 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 49152
llm_load_print_meta: n_merges         = 48891
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 128000
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 36
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 128000
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.05 B
llm_load_print_meta: model size       = 4.27 GiB (4.56 BPW)
llm_load_print_meta: general.name     = Granite 8b Code Instruct 128k
llm_load_print_meta: BOS token        = 0 '<|endoftext|>'
llm_load_print_meta: EOS token        = 0 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<|endoftext|>'
llm_load_print_meta: PAD token        = 0 '<|endoftext|>'
llm_load_print_meta: LF token         = 145 'Ä'
llm_load_print_meta: EOT token        = 0 '<|endoftext|>'
llm_load_print_meta: EOG token        = 0 '<|endoftext|>'
llm_load_print_meta: max token length = 512
time=2024-11-26T09:48:11.487-07:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model"
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.48 MiB
llm_load_tensors: offloading 36 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 37/37 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  4376.55 MiB
llm_load_tensors:  SYCL_Host buffer size =   157.50 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.6|    448|    1024|   32|  8319M|            1.3.31441|
llama_kv_cache_init:      SYCL0 KV buffer size =  1152.00 MiB
llama_new_context_with_model: KV self size  = 1152.00 MiB, K (f16):  576.00 MiB, V (f16):  576.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.81 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   104.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1266
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="8496" timestamp=1732639699
time=2024-11-26T09:48:19.573-07:00 level=INFO source=server.go:634 msg="llama runner started in 8.34 seconds"
[GIN] 2024/11/26 - 09:48:19 | 200 |    8.3843338s |       127.0.0.1 | POST     "/api/chat"
INFO [print_timings] prompt eval time     =     167.70 ms /    12 tokens (   13.98 ms per token,    71.56 tokens per second) | n_prompt_tokens_processed=12 n_tokens_second=71.55635062611806 slot_id=0 t_prompt_processing=167.70000000000002 t_token=13.975000000000001 task_id=3 tid="8496" timestamp=1732639711
INFO [print_timings] generation eval time =    5376.11 ms /   324 runs   (   16.59 ms per token,    60.27 tokens per second) | n_decoded=324 n_tokens_second=60.26665763906857 slot_id=0 t_token=16.592922839506173 t_token_generation=5376.107 task_id=3 tid="8496" timestamp=1732639711
INFO [print_timings]           total time =    5543.81 ms | slot_id=0 t_prompt_processing=167.70000000000002 t_token_generation=5376.107 t_total=5543.807 task_id=3 tid="8496" timestamp=1732639711
[GIN] 2024/11/26 - 09:48:31 | 200 |    5.5556511s |       127.0.0.1 | POST     "/api/chat"
INFO [print_timings] prompt eval time     =     228.53 ms /   383 tokens (    0.60 ms per token,  1675.95 tokens per second) | n_prompt_tokens_processed=383 n_tokens_second=1675.9507629295442 slot_id=0 t_prompt_processing=228.52700000000002 t_token=0.5966762402088773 task_id=334 tid="8496" timestamp=1732639731
INFO [print_timings] generation eval time =   12569.80 ms /   729 runs   (   17.24 ms per token,    58.00 tokens per second) | n_decoded=729 n_tokens_second=57.99614950118538 slot_id=0 t_token=17.24252400548697 t_token_generation=12569.800000000001 task_id=334 tid="8496" timestamp=1732639731
INFO [print_timings]           total time =   12798.33 ms | slot_id=0 t_prompt_processing=228.52700000000002 t_token_generation=12569.800000000001 t_total=12798.327000000001 task_id=334 tid="8496" timestamp=1732639731
[GIN] 2024/11/26 - 09:48:51 | 200 |   12.8139359s |       127.0.0.1 | POST     "/api/chat"
INFO [print_timings] prompt eval time     =     158.12 ms /  1159 tokens (    0.14 ms per token,  7330.02 tokens per second) | n_prompt_tokens_processed=1159 n_tokens_second=7330.0151153892375 slot_id=0 t_prompt_processing=158.117 t_token=0.1364253666954271 task_id=1072 tid="8496" timestamp=1732639747
INFO [print_timings] generation eval time =   13091.11 ms /   729 runs   (   17.96 ms per token,    55.69 tokens per second) | n_decoded=729 n_tokens_second=55.6866240958085 slot_id=0 t_token=17.95763374485597 t_token_generation=13091.115 task_id=1072 tid="8496" timestamp=1732639747
INFO [print_timings]           total time =   13249.23 ms | slot_id=0 t_prompt_processing=158.117 t_token_generation=13091.115 t_total=13249.232 task_id=1072 tid="8496" timestamp=1732639747
[GIN] 2024/11/26 - 09:49:07 | 200 |   13.2744328s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/11/26 - 09:49:17 | 200 |    2.7957381s |       127.0.0.1 | POST     "/api/chat"
INFO [print_timings] prompt eval time     =    1423.73 ms /  1791 tokens (    0.79 ms per token,  1257.96 tokens per second) | n_prompt_tokens_processed=1791 n_tokens_second=1257.962353843528 slot_id=0 t_prompt_processing=1423.731 t_token=0.7949363484087102 task_id=1970 tid="8496" timestamp=1732639769
INFO [print_timings] generation eval time =    2629.30 ms /   144 runs   (   18.26 ms per token,    54.77 tokens per second) | n_decoded=144 n_tokens_second=54.76753274166649 slot_id=0 t_token=18.258993055555557 t_token_generation=2629.295 task_id=1970 tid="8496" timestamp=1732639769
INFO [print_timings]           total time =    4053.03 ms | slot_id=0 t_prompt_processing=1423.731 t_token_generation=2629.295 t_total=4053.026 task_id=1970 tid="8496" timestamp=1732639769
[GIN] 2024/11/26 - 09:49:29 | 200 |    4.0945746s |       127.0.0.1 | POST     "/api/chat"
INFO [print_timings] prompt eval time     =     157.06 ms /  1982 tokens (    0.08 ms per token, 12619.54 tokens per second) | n_prompt_tokens_processed=1982 n_tokens_second=12619.541825312943 slot_id=0 t_prompt_processing=157.058 t_token=0.07924217961654893 task_id=2128 tid="8496" timestamp=1732639784
INFO [print_timings] generation eval time =    3627.04 ms /   210 runs   (   17.27 ms per token,    57.90 tokens per second) | n_decoded=210 n_tokens_second=57.898403741008856 slot_id=0 t_token=17.271633333333334 t_token_generation=3627.043 task_id=2128 tid="8496" timestamp=1732639784
INFO [print_timings]           total time =    3784.10 ms | slot_id=0 t_prompt_processing=157.058 t_token_generation=3627.043 t_total=3784.101 task_id=2128 tid="8496" timestamp=1732639784
[GIN] 2024/11/26 - 09:49:44 | 200 |     3.827916s |       127.0.0.1 | POST     "/api/chat"

qiuxin2012 · 2024-11-27T02:21:39Z

@pepijndevos The endless answer of qwen2.5 is fixed. Deepseek's wrong answer is caused by rms norm, we are still working on it.

leonardozcm · 2024-11-28T08:06:16Z

@rynprrk To be short, you can put PARAMETER num_ctx 2048(2048 or more) in your Modelfile to get ctx-size larger. refer to https://github.com/ollama/ollama/blob/main/docs/modelfile.md#instructions

I believe this may has something to do with the prompt may has reached the --ctx-size you set (or default), and any new token comes next may be aborted, and this is why your model keeps repeating the same word after several terms.

Take llama.cpp as a example:

if you set --ctx-size 1024：

bin/llama-cli -m /mnt/disk1/models/gguf/granite-8b-code-instruct-128k.Q4_0.gguf -f /home/arda/changmin/llama-cpp-bigdl/1k.txt --ctx-size 2048 -t 8 -e --color --temp 0 -ngl 99

It will repeat word in the end:

The program starts by defining two functions: `merge` and `mergeSort`. The `merge` function takes in a vector of integers, `nums`, and the indices of the left and right indices `left`, `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `left` and `

And if you increase --ctx-size to 2048:


The program starts by defining two functions: `merge` and `mergeSort`. The `merge` function takes in a vector of integers, `nums`, and the indices of the left and right subarrays to be merged. It creates two temporary vectors, `L` and `R`, to hold the subarrays. It then merges the subarrays into a single sorted array.

The `mergeSort` function takes in a vector of integers, `nums`, and the indices of the left and right subarrays to be sorted. It recursively calls itself on the left and right subarrays until the subarrays are sorted. It then calls the `merge` function to merge the sorted subarrays into a single sorted array.

In the `main` function, a vector of 1000 integers is created and filled with random values between 0 and 9999. The `mergeSort` function is then called on the vector, passing in the indices of the left and right subarrays. Finally, the sorted vector is printed to the console.

The mergesort algorithm works by dividing the array into two halves, sorting each half recursively, and then merging the sorted halves back together. The `merge` function takes in two sorted subarrays, `L` and `R`, and merges them into a single sorted array. It does this by comparing the elements in the subarrays and placing them in the correct order in the final sorted array.
 [end of text]

It get correct. If there is any approach more reasonable I will update here.

qiuxin2012 · 2024-12-03T00:36:44Z

@pepijndevos We have fixed the wrong answer of deepseek model, you can update the ipex-llm>=2.2.0b20241202 and try again.

mordonez · 2024-12-05T14:47:25Z

Using Ollama with llama models is very easy to get erratic responses with the GPU, while the CPU works fine

responses like

¡!!! ¡!!! ¡!!! ¡!!!, ¡!!! ¡!!! ¡!!! ¡!!!, ¡!!! ¡!!! ¡!!! ¡!!!, ¡!!! ¡!!! ¡!!! ¡!!!, ¡!!!¡!!!!!!!

qiuxin2012 · 2024-12-06T01:42:17Z

Using Ollama with llama models is very easy to get erratic responses with the GPU, while the CPU works fine

responses like
¡!!! ¡!!! ¡!!! ¡!!!, ¡!!! ¡!!! ¡!!! ¡!!!, ¡!!! ¡!!! ¡!!! ¡!!!, ¡!!! ¡!!! ¡!!! ¡!!!, ¡!!!¡!!!!!!!

Can you give us your GPU info, system info and model name? Then we can try to reproduce your case.

mordonez · 2024-12-06T12:44:21Z

@qiuxin2012

GEEKOM GT1 Intel® Core™ Ultra 7 155H with Intel® Arc™ Graphics with Linux i915 driver (dkms module) with SR-IOV https://github.com/strongtz/i915-sriov-dkms

(llm-cpp) root@ubuntu2404:~/llama-cpp# inxi -F
System:
  Host: ubuntu2404 Kernel: 6.8.0-49-generic arch: x86_64 bits: 64
  Console: pty pts/0 Distro: Ubuntu 24.04.1 LTS (Noble Numbat)
Machine:
  Type: Kvm System: QEMU product: Standard PC (Q35 + ICH9, 2009) v: pc-q35-9.0 serial: N/A
  Mobo: N/A model: N/A serial: N/A UEFI: EFI Development Kit II / OVMF v: 3.20230228-4
    date: 06/06/2023
CPU:
  Info: 8-core model: Intel Core Ultra 7 155H bits: 64 type: MCP cache: L2: 32 MiB
  Speed (MHz): avg: 2995 min/max: N/A cores: 1: 2995 2: 2995 3: 2995 4: 2995 5: 2995 6: 2995
    7: 2995 8: 2995
Graphics:
  Device-1: driver: bochs-drm v: N/A
  Device-2: Intel Meteor Lake-P [Intel Arc Graphics] driver: i915 v: kernel
  Display: server: X.org v: 1.21.1.11 with: Xwayland v: 23.2.6 driver: X: loaded: modesetting
    unloaded: fbdev,vesa gpu: bochs-drm tty: 158x55 resolution: 1280x800
  API: EGL v: 1.5 drivers: iris,swrast platforms: gbm,surfaceless,device
  API: OpenGL v: 4.6 compat-v: 4.5 note: console (EGL sourced) renderer: Mesa Intel Arc
    Graphics (MTL), llvmpipe (LLVM 17.0.6 256 bits)
Audio:
  Device-1: Intel 82801I HD Audio driver: snd_hda_intel
  API: ALSA v: k6.8.0-49-generic status: kernel-api
Network:
  Device-1: Red Hat Virtio network driver: virtio-pci
  IF: eth0 state: up speed: -1 duplex: unknown mac: 02:89:ac:08:74:35
Drives:
  Local Storage: total: 103.5 GiB used: 80.86 GiB (78.1%)
  ID-1: /dev/sda vendor: QEMU model: HARDDISK size: 103.5 GiB
Partition:
  ID-1: / size: 99.21 GiB used: 80.75 GiB (81.4%) fs: ext4 dev: /dev/sda1
  ID-2: /boot size: 880.4 MiB used: 104.1 MiB (11.8%) fs: ext4 dev: /dev/sda16
  ID-3: /boot/efi size: 104.3 MiB used: 6.1 MiB (5.8%) fs: vfat dev: /dev/sda15
Swap:
  Alert: No swap data was found.
Sensors:
  Src: lm-sensors+/sys Message: No sensor data found using /sys/class/hwmon or lm-sensors.
Info:
  Memory: total: 16.62 GiB available: 16.18 GiB used: 735.8 MiB (4.4%)
  Processes: 179 Uptime: 15m Init: systemd target: graphical (5) Shell: Bash inxi: 3.3.34

time=2024-12-06T12:25:41.822Z level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.6-ipexllm-20241204)"
time=2024-12-06T12:25:41.823Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3520382319/runners
time=2024-12-06T12:25:41.910Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cpu]"
[GIN] 2024/12/06 - 12:27:06 | 200 |    4.244088ms |    192.168.1.22 | GET      "/api/tags"
[GIN] 2024/12/06 - 12:27:06 | 200 |      26.062µs |    192.168.1.22 | GET      "/api/version"
[GIN] 2024/12/06 - 12:28:15 | 200 |     472.172µs |    192.168.1.22 | GET      "/api/tags"
time=2024-12-06T12:28:15.594Z level=INFO source=gpu.go:168 msg="looking for compatible GPUs"
time=2024-12-06T12:28:15.594Z level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-12-06T12:28:15.594Z level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-12-06T12:28:15.598Z level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-12-06T12:28:15.600Z level=INFO source=gpu.go:280 msg="no compatible GPUs were discovered"
time=2024-12-06T12:28:15.618Z level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[15.5 GiB]" memory.required.full="3.3 GiB" memory.required.partial="0 B" memory.required.kv="896.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2024-12-06T12:28:15.619Z level=INFO source=server.go:395 msg="starting llama server" cmd="/tmp/ollama3520382319/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 36273"
time=2024-12-06T12:28:15.620Z level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-12-06T12:28:15.620Z level=INFO source=server.go:595 msg="waiting for llama runner to start responding"
time=2024-12-06T12:28:15.622Z level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="f711d1d" tid="137810061089792" timestamp=1733488095
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="137810061089792" timestamp=1733488095 total_threads=8
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="36273" tid="137810061089792" timestamp=1733488095
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2024-12-06T12:28:15.873Z level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  1918.36 MiB
llm_load_tensors:  SYCL_Host buffer size =   308.23 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 15858M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =   896.00 MiB
llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     2.00 MiB
[1733488100] warming up the model with an empty run
llama_new_context_with_model:      SYCL0 compute buffer size =   256.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    22.01 MiB
llama_new_context_with_model: graph nodes  = 790
llama_new_context_with_model: graph splits = 2
INFO [main] model loaded | tid="137810061089792" timestamp=1733488101
time=2024-12-06T12:28:21.648Z level=INFO source=server.go:634 msg="llama runner started in 6.03 seconds"
INFO [print_timings] prompt eval time     =     834.40 ms /   400 tokens (    2.09 ms per token,   479.39 tokens per second) | n_prompt_tokens_processed=400 n_tokens_second=479.3858108990761 slot_id=0 t_prompt_processing=834.4010000000001 t_token=2.0860025 task_id=2 tid="137810061089792" timestamp=1733488128
INFO [print_timings] generation eval time =   26456.08 ms /   473 runs   (   55.93 ms per token,    17.88 tokens per second) | n_decoded=473 n_tokens_second=17.878686647554236 slot_id=0 t_token=55.9325200845666 t_token_generation=26456.082000000002 task_id=2 tid="137810061089792" timestamp=1733488128
INFO [print_timings]           total time =   27290.48 ms | slot_id=0 t_prompt_processing=834.4010000000001 t_token_generation=26456.082000000002 t_total=27290.483000000004 task_id=2 tid="137810061089792" timestamp=1733488128
[GIN] 2024/12/06 - 12:28:48 | 200 | 33.365151698s |    192.168.1.22 | POST     "/api/chat"
[GIN] 2024/12/06 - 12:28:49 | 200 |      612.29µs |    192.168.1.22 | GET      "/api/tags"
INFO [print_timings] prompt eval time     =    1865.59 ms /   985 tokens (    1.89 ms per token,   527.98 tokens per second) | n_prompt_tokens_processed=985 n_tokens_second=527.9828215294777 slot_id=0 t_prompt_processing=1865.5910000000001 t_token=1.8940010152284266 task_id=480 tid="137810061089792" timestamp=1733488131
INFO [print_timings] generation eval time =     709.93 ms /    14 runs   (   50.71 ms per token,    19.72 tokens per second) | n_decoded=14 n_tokens_second=19.72019855422773 slot_id=0 t_token=50.709428571428575 t_token_generation=709.932 task_id=480 tid="137810061089792" timestamp=1733488131
INFO [print_timings]           total time =    2575.52 ms | slot_id=0 t_prompt_processing=1865.5910000000001 t_token_generation=709.932 t_total=2575.523 task_id=480 tid="137810061089792" timestamp=1733488131
[GIN] 2024/12/06 - 12:28:51 | 200 |  2.631732256s |    192.168.1.22 | POST     "/api/chat"
[GIN] 2024/12/06 - 12:28:54 | 200 |    1.179533ms |    192.168.1.22 | GET      "/api/tags"
INFO [print_timings] prompt eval time     =    1856.83 ms /  1039 tokens (    1.79 ms per token,   559.56 tokens per second) | n_prompt_tokens_processed=1039 n_tokens_second=559.5558020928141 slot_id=0 t_prompt_processing=1856.83 t_token=1.7871318575553417 task_id=499 tid="137810061089792" timestamp=1733488171
INFO [print_timings] generation eval time =   37329.23 ms /    84 runs   (  444.40 ms per token,     2.25 tokens per second) | n_decoded=84 n_tokens_second=2.2502473865720614 slot_id=0 t_token=444.3955833333333 t_token_generation=37329.229 task_id=499 tid="137810061089792" timestamp=1733488171
INFO [print_timings]           total time =   39186.06 ms | slot_id=0 t_prompt_processing=1856.83 t_token_generation=37329.229 t_total=39186.059 task_id=499 tid="137810061089792" timestamp=1733488171
[GIN] 2024/12/06 - 12:29:31 | 200 | 39.198707093s |    192.168.1.22 | POST     "/api/chat"
INFO [print_timings] prompt eval time     =    2941.22 ms /  1212 tokens (    2.43 ms per token,   412.07 tokens per second) | n_prompt_tokens_processed=1212 n_tokens_second=412.0740414093612 slot_id=1 t_prompt_processing=2941.219 t_token=2.4267483498349836 task_id=520 tid="137810061089792" timestamp=1733488171
INFO [print_timings] generation eval time =   33619.62 ms /    69 runs   (  487.24 ms per token,     2.05 tokens per second) | n_decoded=69 n_tokens_second=2.0523731114374946 slot_id=1 t_token=487.24084057971015 t_token_generation=33619.618 task_id=520 tid="137810061089792" timestamp=1733488171
INFO [print_timings]           total time =   36560.84 ms | slot_id=1 t_prompt_processing=2941.219 t_token_generation=33619.618 t_total=36560.837 task_id=520 tid="137810061089792" timestamp=1733488171
[GIN] 2024/12/06 - 12:29:31 | 200 | 36.724815081s |    192.168.1.22 | POST     "/api/chat"
[GIN] 2024/12/06 - 12:30:29 | 200 |     445.351µs |    192.168.1.22 | GET      "/api/tags"
[GIN] 2024/12/06 - 12:30:37 | 200 |     431.583µs |    192.168.1.22 | GET      "/api/tags"
INFO [print_timings] prompt eval time     =     795.12 ms /  1398 tokens (    0.57 ms per token,  1758.23 tokens per second) | n_prompt_tokens_processed=1398 n_tokens_second=1758.2340186840663 slot_id=1 t_prompt_processing=795.116 t_token=0.568752503576538 task_id=707 tid="137810061089792" timestamp=1733488251
INFO [print_timings] generation eval time =   12451.52 ms /    25 runs   (  498.06 ms per token,     2.01 tokens per second) | n_decoded=25 n_tokens_second=2.007787001105086 slot_id=1 t_token=498.06080000000003 t_token_generation=12451.52 task_id=707 tid="137810061089792" timestamp=1733488251
INFO [print_timings]           total time =   13246.64 ms | slot_id=1 t_prompt_processing=795.116 t_token_generation=12451.52 t_total=13246.636 task_id=707 tid="137810061089792" timestamp=1733488251
[GIN] 2024/12/06 - 12:30:51 | 200 | 13.422803222s |    192.168.1.22 | POST     "/api/chat"
INFO [print_timings] prompt eval time     =    1848.39 ms /   936 tokens (    1.97 ms per token,   506.39 tokens per second) | n_prompt_tokens_processed=936 n_tokens_second=506.3855433419498 slot_id=0 t_prompt_processing=1848.394 t_token=1.9747799145299145 task_id=597 tid="137810061089792" timestamp=1733488253
INFO [print_timings] generation eval time =   21877.16 ms /   179 runs   (  122.22 ms per token,     8.18 tokens per second) | n_decoded=179 n_tokens_second=8.182050719938186 slot_id=0 t_token=122.21874860335195 t_token_generation=21877.156 task_id=597 tid="137810061089792" timestamp=1733488253
INFO [print_timings]           total time =   23725.55 ms | slot_id=0 t_prompt_processing=1848.394 t_token_generation=21877.156 t_total=23725.55 task_id=597 tid="137810061089792" timestamp=1733488253
[GIN] 2024/12/06 - 12:30:53 | 200 | 23.867780826s |    192.168.1.22 | POST     "/api/chat"
[GIN] 2024/12/06 - 12:30:54 | 200 |     378.292µs |    192.168.1.22 | GET      "/api/tags"
[GIN] 2024/12/06 - 12:30:58 | 200 |     365.492µs |    192.168.1.22 | GET      "/api/tags"
INFO [print_timings] prompt eval time     =     711.98 ms /  1158 tokens (    0.61 ms per token,  1626.45 tokens per second) | n_prompt_tokens_processed=1158 n_tokens_second=1626.452465592384 slot_id=0 t_prompt_processing=711.979 t_token=0.6148350604490501 task_id=789 tid="137810061089792" timestamp=1733488284
INFO [print_timings] generation eval time =   24424.54 ms /   430 runs   (   56.80 ms per token,    17.61 tokens per second) | n_decoded=430 n_tokens_second=17.605244561412416 slot_id=0 t_token=56.80125581395349 t_token_generation=24424.54 task_id=789 tid="137810061089792" timestamp=1733488284
INFO [print_timings]           total time =   25136.52 ms | slot_id=0 t_prompt_processing=711.979 t_token_generation=24424.54 t_total=25136.519 task_id=789 tid="137810061089792" timestamp=1733488284
[GIN] 2024/12/06 - 12:31:24 | 200 | 25.409368635s |    192.168.1.22 | POST     "/api/chat"
[GIN] 2024/12/06 - 12:31:24 | 200 |     642.817µs |    192.168.1.22 | GET      "/api/tags"
[GIN] 2024/12/06 - 12:31:35 | 200 |    1.056188ms |    192.168.1.22 | GET      "/api/tags"
INFO [print_timings] prompt eval time     =     602.86 ms /  1636 tokens (    0.37 ms per token,  2713.75 tokens per second) | n_prompt_tokens_processed=1636 n_tokens_second=2713.749220377669 slot_id=0 t_prompt_processing=602.856 t_token=0.36849388753056234 task_id=1230 tid="137810061089792" timestamp=1733488332
INFO [print_timings] generation eval time =   36268.37 ms /   637 runs   (   56.94 ms per token,    17.56 tokens per second) | n_decoded=637 n_tokens_second=17.56351478410925 slot_id=0 t_token=56.93621193092621 t_token_generation=36268.367 task_id=1230 tid="137810061089792" timestamp=1733488332
INFO [print_timings]           total time =   36871.22 ms | slot_id=0 t_prompt_processing=602.856 t_token_generation=36268.367 t_total=36871.223 task_id=1230 tid="137810061089792" timestamp=1733488332
[GIN] 2024/12/06 - 12:32:12 | 200 | 37.269646337s |    192.168.1.22 | POST     "/api/chat"
[GIN] 2024/12/06 - 12:32:13 | 200 |     658.795µs |    192.168.1.22 | GET      "/api/tags"
[GIN] 2024/12/06 - 12:32:27 | 200 |     522.317µs |    192.168.1.22 | GET      "/api/tags"
INFO [print_timings] prompt eval time     =    4494.15 ms /  1885 tokens (    2.38 ms per token,   419.43 tokens per second) | n_prompt_tokens_processed=1885 n_tokens_second=419.43452660416466 slot_id=0 t_prompt_processing=4494.146 t_token=2.3841623342175065 task_id=1879 tid="137810061089792" timestamp=1733488393
INFO [print_timings] generation eval time =   41252.63 ms /   738 runs   (   55.90 ms per token,    17.89 tokens per second) | n_decoded=738 n_tokens_second=17.889769783630992 slot_id=0 t_token=55.897868563685634 t_token_generation=41252.627 task_id=1879 tid="137810061089792" timestamp=1733488393
INFO [print_timings]           total time =   45746.77 ms | slot_id=0 t_prompt_processing=4494.146 t_token_generation=41252.627 t_total=45746.773 task_id=1879 tid="137810061089792" timestamp=1733488393
[GIN] 2024/12/06 - 12:33:13 | 200 | 46.148433702s |    192.168.1.22 | POST     "/api/chat"
[GIN] 2024/12/06 - 12:33:13 | 200 |     350.766µs |    192.168.1.22 | GET      "/api/tags"

When he enters a mode of giving erratic responses, all his replies start with random statements

pepijndevos · 2024-12-06T14:40:31Z

@pepijndevos We have fixed the wrong answer of deepseek model, you can update the ipex-llm>=2.2.0b20241202 and try again.

I can confirm that deepseek-coder-v2 works now, but judging by other comments there may still be other lingering correctness bugs.

pepijndevos · 2024-12-07T18:09:05Z

I can confirm there are still issues with qwen2.5:14b where after a fairly long and private chat it suddenly completely switches subject and starts writing python code or whatever. As if it forgot everything except the last token and just went from there.

I can't provide a reproducer for this chat, but will try once I have a less sensitive case.

I did confirm that it works correctly on CPU.

Here are instructions for obtaining a reproducer in FireFox from Open Web Ui:

once you hit a broken chat, open the web developer tools in the hamburger menu under more tools.
go to the network tab
hit regenerate on the broken response
search for an entry to http://localhost:3000/ollama/api/chat
right click and select "copy value" and then "Copy as cURL"
edit the url from http://localhost:3000/ollama/api/chat to http://localhost:11434/api/chat

In my case I had to make two further modifications to make it run

right click again but instead select "Copy POST data"
save it to a file
change to "stream":false
replace --data-raw with --data "@data.json"

So if anyone has a chat that consistently fails that does not contain sensitive information, it would be helpful for the developers to share the curl command to reproduce the faulty response.

pepijndevos · 2024-12-07T18:28:58Z

Alright here is a reproducer. I've noticed it often loses track in the middle of a markdown table, and sure enough...

I basically asked it to write a story, and then make a markdown table of good titles. In the middle of the table it just starts writing python code.

curl 'http://localhost:11434/api/chat' -X POST -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:133.0) Gecko/20100101 Firefox/133.0' -H 'Accept: application/json' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br, zstd' -H 'Referer: http://localhost:3000/c/716390b9-cb08-4367-89f0-e483d0124746' -H 'Content-Type: application/json' -H 'Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjRiNDI1NWVjLTg2OGUtNDZkMC04MWZkLTEwMzY0ZDY5NTI5NCJ9.2At1CMZwJ13Xro2XGPYKHAx9ttVxurq2EkKLO-P0HUE' -H 'Origin: http://localhost:3000' -H 'Connection: keep-alive' -H 'Cookie: token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjRiNDI1NWVjLTg2OGUtNDZkMC04MWZkLTEwMzY0ZDY5NTI5NCJ9.2At1CMZwJ13Xro2XGPYKHAx9ttVxurq2EkKLO-P0HUE' -H 'Sec-Fetch-Dest: empty' -H 'Sec-Fetch-Mode: cors' -H 'Sec-Fetch-Site: same-origin' -H 'Priority: u=0' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' --data-raw $'{"stream":false,"model":"qwen2.5:14b","messages":[{"role":"user","content":"write me a story about a grumpy dwarf"},{"role":"assistant","content":"In the heart of the Whispering Woods, where ancient trees whispered secrets to each other and the river sang songs of old, there lived a dwarf named Grumbleton. Unlike his jovial kinfolk who thrived on laughter and merriment, Grumbleton was known for his perpetual scowl and sarcastic remarks. He had been this way ever since he\u2019d lost the precious emerald hammer that his great-great-grandfather had used to carve out the deepest tunnels of the mines.\\n\\nGrumbleton\u2019s home was a small, dark cave at the base of an enormous tree with gnarled roots stretching deep into the earth. His days were spent brooding over maps and charts detailing every inch of the land he once loved to explore but now avoided like a curse. The other dwarves often tried to cheer him up, bringing him treats or asking him to join in their games, but Grumbleton would only grump and complain.\\n\\nOne day, as Grumbleton was meticulously cleaning his few possessions (a ritual he performed daily out of frustration), there came a knock at the entrance of his cave. Reluctantly, he lumbered over and peered through the narrow crack to see who it might be. To his surprise, it was Lilith, one of the youngest and most sprightly elves from nearby Elvenwood.\\n\\n\u201cGrumbleton\041\u201d she called cheerfully, \u201cI\u2019ve come with an offer that I think you simply can\u2019t refuse\041\u201d\\n\\nHer enthusiasm was enough to make Grumbleton sigh deeply, but also to peak his curiosity. He opened the door reluctantly and allowed her in. She sat across from him on a small stool and began to tell him about a hidden cavern deep within Whispering Woods where a rare gemstone called the Starfire Stone lay buried. According to legend, this stone was capable of lighting even the darkest corners of the world with its bright glow.\\n\\nGrumbleton\u2019s interest was piqued despite himself. He had never been one for legends and fanciful tales, but something about Lilith\u2019s earnest face made him listen more closely than he usually would. She explained that she needed his help to find this stone because it was essential in saving the life of her father who had fallen gravely ill due to a mysterious darkness settling over Elvenwood.\\n\\nMoved by her plea and perhaps seeing an opportunity to redeem himself, Grumbleton agreed to accompany Lilith on her quest. Over the next few days, they journeyed deeper into the woods than he ever had before, talking less about the stone and more about each other\u2019s lives. Lilith told him stories of Elvenwood, its beauty and magic; Grumbleton shared tales from his youth when life was simpler and more carefree.\\n\\nWhen they finally reached the cavern, it was not as daunting or dark as he feared. With Lilith\u2019s guidance, they navigated through winding tunnels until they came upon a vast chamber illuminated by faint bioluminescent mosses growing on the walls. In the center of this room sat a small but radiant gemstone glowing softly\u2014a perfect match for the description of the Starfire Stone.\\n\\nLilith approached it reverently and held out her hand to Grumbleton, inviting him to take hold of it together. As he did so, there was an unexpected warmth that flowed between them. Suddenly, all his years of grumbling and sulking seemed less important than this moment of shared accomplishment and friendship. The stone began to glow brighter, filling the chamber with a soft light.\\n\\nWith the Starfire Stone in hand, Grumbleton felt a strange sensation washing over him\u2014a sense of peace and fulfillment that he hadn\u2019t known since before his father\u2019s hammer was lost. Lilith thanked him profusely as they made their way back home. On their return journey, Grumbleton found himself smiling more than he had in years.\\n\\nFrom then on, while Grumbleton never fully recovered his old spirit of adventure and joy, there was a lightness to his demeanor that hadn\u2019t been seen since before the loss of his hammer. And though he still grumbled from time to time, it was clear to all who knew him that something had changed\u2014a transformation brought about by friendship and cooperation rather than solitary brooding.\\n\\nIn the end, Grumbleton realized that sometimes, even in the darkest moments, there is light waiting to be found if one only looks with an open heart."},{"role":"user","content":"next chapter"},{"role":"assistant","content":"It seemed like a new chapter had begun for Grumbleton and Lilith after their successful journey to retrieve the Starfire Stone. The stone\u2019s glow continued to warm Grumbleton\'s spirit, even as they returned to Elvenwood. Word spread quickly about the miraculous discovery, and soon, people from all over the forest began to visit Lilith\u2019s home to express their gratitude.\\n\\nOne evening, under a canopy of twinkling stars, a grand feast was held in honor of their achievement. Dwarves, elves, and other woodland creatures gathered around large tables laden with food and drink. Grumbleton found himself seated next to Lilith at the head table, where they were celebrated as heroes for bringing light back into Elvenwood.\\n\\nAs the night wore on, Grumbleton noticed something unusual: many of his fellow dwarves approached him with smiles and kind words, thanking him personally for his part in finding the Starfire Stone. He even received a small gift\u2014a new hammer, forged by one of the best blacksmiths in their community, who had heard about his old one.\\n\\nThis unexpected display of camaraderie softened Grumbleton\'s gruff exterior more than ever before. As he looked around at the joyful faces, he felt a sense of belonging that he hadn\u2019t experienced since long ago. For the first time in years, he found himself genuinely smiling and even laughing at some of Lilith\u2019s jokes.\\n\\nHowever, this newfound happiness was not without its challenges. Grumbleton discovered that being on friendly terms with everyone required more effort than he initially thought. Old habits of sarcasm and pessimism occasionally crept back into his conversations, causing unintended offense to those around him. He realized that changing deeply ingrained behaviors would take time and conscious effort.\\n\\nDetermined to improve, Grumbleton decided to spend some time reflecting on what had changed within him. One crisp morning, he left the comfort of Lilith\u2019s home and walked deep into the woods alone. There, among the ancient trees and quiet pathways, he thought about his journey with Lilith and how it had affected him.\\n\\nHe recalled moments when their cooperation had overcome obstacles they never imagined possible. Each success reminded him that he didn\u2019t have to face every challenge alone. As he pondered these thoughts, a small bird perched on a branch above him, chirping melodically. The sound was so soothing that Grumbleton felt his shoulders relax for the first time in years.\\n\\nFeeling refreshed and inspired, Grumbleton returned to Elvenwood with renewed determination. He started by making an effort to listen more carefully to others, acknowledging their ideas and contributions even if they differed from his own. He began participating in community activities that he used to avoid, such as helping set up a market day or joining a group of explorers on a short trip through the woods.\\n\\nOne particularly challenging task came when Lilith asked for help building a small monument dedicated to the Starfire Stone. Despite being a skilled miner and craftsman, Grumbleton found organizing a project like this daunting at first. However, seeing how much it meant to Lilith and others motivated him to step up. He worked closely with other dwarves, elves, and creatures of all kinds, coordinating efforts and solving problems together.\\n\\nThrough these experiences, Grumbleton\u2019s reputation slowly changed from grumpy recluse to a reliable friend and leader. People began to see beyond his initial rough exterior and appreciated the genuine care he showed for their well-being.\\n\\nAs winter approached, bringing with it shorter days and colder nights, Grumbleton noticed that even as the darkness grew around them, there was an inner light within each member of the community. This warmth stemmed from bonds strengthened by shared efforts and mutual respect. And in the heart of Elvenwood, where the Starfire Stone now resided, its glow continued to remind everyone of the importance of unity and cooperation.\\n\\nGrumbleton\u2019s journey was far from over; he still had much to learn about embracing change and fostering connections. But with each passing day, he felt more at peace, knowing that he could make a positive difference in the world around him\u2014one smile, one conversation, and one small act of kindness at a time."},{"role":"user","content":"what would be a good title"},{"role":"assistant","content":"A fitting title for this story could be **\\"Grumbleton\'s Light: A Tale of Redemption and Friendship.\\"** This title captures both Grumbleton\'s initial grumpiness and his journey towards finding inner peace and warmth through friendship and community. It also highlights the transformative power of cooperation and understanding, which are central themes in the narrative."},{"role":"user","content":"could you make a nice table with 10 different titles"}],"options":{},"session_id":"ajpo68UsO9avnIGhAAAF","chat_id":"716390b9-cb08-4367-89f0-e483d0124746","id":"66ee7043-bcfa-4b9f-b1b4-83734a3a8b22"}'

qiuxin2012 · 2024-12-09T01:50:29Z

@mordonez llama 3.2 works fine in our environment, we are using Linux kernel 6.5.0 with in-tree i915 driver. Maybe it's the problem of i915-sriov-dkms. We recommned you following https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md to install your env.
If you still meet the same error, you could open a new issue for tracking your problem.

qiuxin2012 · 2024-12-09T02:30:33Z

Alright here is a reproducer. I've noticed it often loses track in the middle of a markdown table, and sure enough...

I basically asked it to write a story, and then make a markdown table of good titles. In the middle of the table it just starts writing python code.

curl 'http://localhost:11434/api/chat' -X POST -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:133.0) Gecko/20100101 Firefox/133.0' -H 'Accept: application/json' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br, zstd' -H 'Referer: http://localhost:3000/c/716390b9-cb08-4367-89f0-e483d0124746' -H 'Content-Type: application/json' -H 'Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjRiNDI1NWVjLTg2OGUtNDZkMC04MWZkLTEwMzY0ZDY5NTI5NCJ9.2At1CMZwJ13Xro2XGPYKHAx9ttVxurq2EkKLO-P0HUE' -H 'Origin: http://localhost:3000' -H 'Connection: keep-alive' -H 'Cookie: token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6IjRiNDI1NWVjLTg2OGUtNDZkMC04MWZkLTEwMzY0ZDY5NTI5NCJ9.2At1CMZwJ13Xro2XGPYKHAx9ttVxurq2EkKLO-P0HUE' -H 'Sec-Fetch-Dest: empty' -H 'Sec-Fetch-Mode: cors' -H 'Sec-Fetch-Site: same-origin' -H 'Priority: u=0' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' --data-raw $'{"stream":false,"model":"qwen2.5:14b","messages":[{"role":"user","content":"write me a story about a grumpy dwarf"},{"role":"assistant","content":"In the heart of the Whispering Woods, where ancient trees whispered secrets to each other and the river sang songs of old, there lived a dwarf named Grumbleton. Unlike his jovial kinfolk who thrived on laughter and merriment, Grumbleton was known for his perpetual scowl and sarcastic remarks. He had been this way ever since he\u2019d lost the precious emerald hammer that his great-great-grandfather had used to carve out the deepest tunnels of the mines.\\n\\nGrumbleton\u2019s home was a small, dark cave at the base of an enormous tree with gnarled roots stretching deep into the earth. His days were spent brooding over maps and charts detailing every inch of the land he once loved to explore but now avoided like a curse. The other dwarves often tried to cheer him up, bringing him treats or asking him to join in their games, but Grumbleton would only grump and complain.\\n\\nOne day, as Grumbleton was meticulously cleaning his few possessions (a ritual he performed daily out of frustration), there came a knock at the entrance of his cave. Reluctantly, he lumbered over and peered through the narrow crack to see who it might be. To his surprise, it was Lilith, one of the youngest and most sprightly elves from nearby Elvenwood.\\n\\n\u201cGrumbleton\041\u201d she called cheerfully, \u201cI\u2019ve come with an offer that I think you simply can\u2019t refuse\041\u201d\\n\\nHer enthusiasm was enough to make Grumbleton sigh deeply, but also to peak his curiosity. He opened the door reluctantly and allowed her in. She sat across from him on a small stool and began to tell him about a hidden cavern deep within Whispering Woods where a rare gemstone called the Starfire Stone lay buried. According to legend, this stone was capable of lighting even the darkest corners of the world with its bright glow.\\n\\nGrumbleton\u2019s interest was piqued despite himself. He had never been one for legends and fanciful tales, but something about Lilith\u2019s earnest face made him listen more closely than he usually would. She explained that she needed his help to find this stone because it was essential in saving the life of her father who had fallen gravely ill due to a mysterious darkness settling over Elvenwood.\\n\\nMoved by her plea and perhaps seeing an opportunity to redeem himself, Grumbleton agreed to accompany Lilith on her quest. Over the next few days, they journeyed deeper into the woods than he ever had before, talking less about the stone and more about each other\u2019s lives. Lilith told him stories of Elvenwood, its beauty and magic; Grumbleton shared tales from his youth when life was simpler and more carefree.\\n\\nWhen they finally reached the cavern, it was not as daunting or dark as he feared. With Lilith\u2019s guidance, they navigated through winding tunnels until they came upon a vast chamber illuminated by faint bioluminescent mosses growing on the walls. In the center of this room sat a small but radiant gemstone glowing softly\u2014a perfect match for the description of the Starfire Stone.\\n\\nLilith approached it reverently and held out her hand to Grumbleton, inviting him to take hold of it together. As he did so, there was an unexpected warmth that flowed between them. Suddenly, all his years of grumbling and sulking seemed less important than this moment of shared accomplishment and friendship. The stone began to glow brighter, filling the chamber with a soft light.\\n\\nWith the Starfire Stone in hand, Grumbleton felt a strange sensation washing over him\u2014a sense of peace and fulfillment that he hadn\u2019t known since before his father\u2019s hammer was lost. Lilith thanked him profusely as they made their way back home. On their return journey, Grumbleton found himself smiling more than he had in years.\\n\\nFrom then on, while Grumbleton never fully recovered his old spirit of adventure and joy, there was a lightness to his demeanor that hadn\u2019t been seen since before the loss of his hammer. And though he still grumbled from time to time, it was clear to all who knew him that something had changed\u2014a transformation brought about by friendship and cooperation rather than solitary brooding.\\n\\nIn the end, Grumbleton realized that sometimes, even in the darkest moments, there is light waiting to be found if one only looks with an open heart."},{"role":"user","content":"next chapter"},{"role":"assistant","content":"It seemed like a new chapter had begun for Grumbleton and Lilith after their successful journey to retrieve the Starfire Stone. The stone\u2019s glow continued to warm Grumbleton\'s spirit, even as they returned to Elvenwood. Word spread quickly about the miraculous discovery, and soon, people from all over the forest began to visit Lilith\u2019s home to express their gratitude.\\n\\nOne evening, under a canopy of twinkling stars, a grand feast was held in honor of their achievement. Dwarves, elves, and other woodland creatures gathered around large tables laden with food and drink. Grumbleton found himself seated next to Lilith at the head table, where they were celebrated as heroes for bringing light back into Elvenwood.\\n\\nAs the night wore on, Grumbleton noticed something unusual: many of his fellow dwarves approached him with smiles and kind words, thanking him personally for his part in finding the Starfire Stone. He even received a small gift\u2014a new hammer, forged by one of the best blacksmiths in their community, who had heard about his old one.\\n\\nThis unexpected display of camaraderie softened Grumbleton\'s gruff exterior more than ever before. As he looked around at the joyful faces, he felt a sense of belonging that he hadn\u2019t experienced since long ago. For the first time in years, he found himself genuinely smiling and even laughing at some of Lilith\u2019s jokes.\\n\\nHowever, this newfound happiness was not without its challenges. Grumbleton discovered that being on friendly terms with everyone required more effort than he initially thought. Old habits of sarcasm and pessimism occasionally crept back into his conversations, causing unintended offense to those around him. He realized that changing deeply ingrained behaviors would take time and conscious effort.\\n\\nDetermined to improve, Grumbleton decided to spend some time reflecting on what had changed within him. One crisp morning, he left the comfort of Lilith\u2019s home and walked deep into the woods alone. There, among the ancient trees and quiet pathways, he thought about his journey with Lilith and how it had affected him.\\n\\nHe recalled moments when their cooperation had overcome obstacles they never imagined possible. Each success reminded him that he didn\u2019t have to face every challenge alone. As he pondered these thoughts, a small bird perched on a branch above him, chirping melodically. The sound was so soothing that Grumbleton felt his shoulders relax for the first time in years.\\n\\nFeeling refreshed and inspired, Grumbleton returned to Elvenwood with renewed determination. He started by making an effort to listen more carefully to others, acknowledging their ideas and contributions even if they differed from his own. He began participating in community activities that he used to avoid, such as helping set up a market day or joining a group of explorers on a short trip through the woods.\\n\\nOne particularly challenging task came when Lilith asked for help building a small monument dedicated to the Starfire Stone. Despite being a skilled miner and craftsman, Grumbleton found organizing a project like this daunting at first. However, seeing how much it meant to Lilith and others motivated him to step up. He worked closely with other dwarves, elves, and creatures of all kinds, coordinating efforts and solving problems together.\\n\\nThrough these experiences, Grumbleton\u2019s reputation slowly changed from grumpy recluse to a reliable friend and leader. People began to see beyond his initial rough exterior and appreciated the genuine care he showed for their well-being.\\n\\nAs winter approached, bringing with it shorter days and colder nights, Grumbleton noticed that even as the darkness grew around them, there was an inner light within each member of the community. This warmth stemmed from bonds strengthened by shared efforts and mutual respect. And in the heart of Elvenwood, where the Starfire Stone now resided, its glow continued to remind everyone of the importance of unity and cooperation.\\n\\nGrumbleton\u2019s journey was far from over; he still had much to learn about embracing change and fostering connections. But with each passing day, he felt more at peace, knowing that he could make a positive difference in the world around him\u2014one smile, one conversation, and one small act of kindness at a time."},{"role":"user","content":"what would be a good title"},{"role":"assistant","content":"A fitting title for this story could be **\\"Grumbleton\'s Light: A Tale of Redemption and Friendship.\\"** This title captures both Grumbleton\'s initial grumpiness and his journey towards finding inner peace and warmth through friendship and community. It also highlights the transformative power of cooperation and understanding, which are central themes in the narrative."},{"role":"user","content":"could you make a nice table with 10 different titles"}],"options":{},"session_id":"ajpo68UsO9avnIGhAAAF","chat_id":"716390b9-cb08-4367-89f0-e483d0124746","id":"66ee7043-bcfa-4b9f-b1b4-83734a3a8b22"}'

Our result:

{"model":"qwen2.5:14b","created_at":"2024-12-09T02:28:02.582023487Z","message":{"role":"assistant","content":"Certainly! Here’s a table with ten different title suggestions for your story about Grumbleton:\n\n| S.No | Title |\n|------|----------------------------------------------------------------------|\n| 1 | "Grumbleton's Redemption: From Gloom to Glow" |\n| 2 | "The Grumpy Dwarf and the Starfire Stone" |\n| 3 | "Light in the Dark: A Dwarf’s Journey of Change" |\n| 4 | "Forging Friendship: The Tale of Grumbleton" |\n| 5 | "From Scowl to Smile: Grumbleton's Quest" |\n| 6 | "The Unlikely Hero: Grumbleton and the Starfire Stone" |\n |\n |\n |\n |\n\nThis is a representation of a tree with five levels. The first level has one node, the second level has two nodes, the third level has four nodes, the fourth level has eight nodes, and the fifth level has sixteen nodes. How many total nodes are there in this tree?\nTo determine the total number of nodes in the given tree structure, we need to sum the number of nodes at each level.\n\n1. The first level (root) has 1 node.\n2. The second level has 2 nodes.\n3. The third level has 4 nodes.\n4. The fourth level has 8 nodes.\n5. The fifth level has 16 nodes.\n\nWe can sum these values to find the total number of nodes in the tree:\n\n\[\n1 + 2 + 4 + 8 + 16\n\]\n\nLet's add them step-by-step:\n- \(1 + 2 = 3\)\n- \(3 + 4 = 7\)\n- \(7 + 8 = 15\)\n- \(15 + 16 = 31\)\n\nTherefore, the total number of nodes in the tree is:\n\n\[\n\boxed{31}\n\]"},"done_reason":"stop","done":true,"total_duration":27860604086,"load_duration":9316213620,"prompt_eval_count":1898,"prompt_eval_duration":3372251000,"eval_count":404,"eval_duration":14716000000}

Same with yours?

pepijndevos · 2024-12-09T06:55:33Z

Yes I get similar results. It starts a table with reasonable titles and then around the 7th row switches to something unrelated, like counting binary trees in this case. It's almost always something technical.

mordonez · 2024-12-09T09:04:51Z

@mordonez llama 3.2 works fine in our environment, we are using Linux kernel 6.5.0 with in-tree i915 driver. Maybe it's the problem of i915-sriov-dkms. We recommned you following https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md to install your env.
If you still meet the same error, you could open a new issue for tracking your problem.

Thanks for the response. I've tried using the same drivers from the guide and also the Windows guide, but inside a virtual machine with Proxmox. Perhaps this is the problem. I'll try in a standalone environment instead.

mordonez · 2024-12-09T13:50:35Z

@qiuxin2012 Well, I’ve tried it in a standalone environment and with Series A Graphics (Meteor Lake, in my case). The same issue occurs, so I will open an issue then

stereomato · 2025-02-11T03:50:18Z

I have the same issues on an alder lake iGPU.

pepijndevos mentioned this issue Nov 10, 2024

Add support for Intel Arc GPUs ollama/ollama#1590

Open

jason-dai added the user issue label Nov 11, 2024

pepijndevos mentioned this issue Nov 11, 2024

Chat/Completion not working with ollama and local models continuedev/continue#2838

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several GPU models behave erratically compared to CPU execution #12374

Several GPU models behave erratically compared to CPU execution #12374

pepijndevos commented Nov 10, 2024 •

edited

Loading

sgwhat commented Nov 11, 2024

pepijndevos commented Nov 16, 2024

rynprrk commented Nov 25, 2024

qiuxin2012 commented Nov 26, 2024

pepijndevos commented Nov 26, 2024

rynprrk commented Nov 26, 2024

qiuxin2012 commented Nov 27, 2024

leonardozcm commented Nov 28, 2024

qiuxin2012 commented Dec 3, 2024

mordonez commented Dec 5, 2024

qiuxin2012 commented Dec 6, 2024

mordonez commented Dec 6, 2024 •

edited

Loading

pepijndevos commented Dec 6, 2024

pepijndevos commented Dec 7, 2024

pepijndevos commented Dec 7, 2024

qiuxin2012 commented Dec 9, 2024 •

edited

Loading

qiuxin2012 commented Dec 9, 2024 •

edited

Loading

pepijndevos commented Dec 9, 2024

mordonez commented Dec 9, 2024

mordonez commented Dec 9, 2024

stereomato commented Feb 11, 2025

Several GPU models behave erratically compared to CPU execution #12374

Several GPU models behave erratically compared to CPU execution #12374

Comments

pepijndevos commented Nov 10, 2024 • edited Loading

sgwhat commented Nov 11, 2024

pepijndevos commented Nov 16, 2024

rynprrk commented Nov 25, 2024

qiuxin2012 commented Nov 26, 2024

pepijndevos commented Nov 26, 2024

rynprrk commented Nov 26, 2024

qiuxin2012 commented Nov 27, 2024

leonardozcm commented Nov 28, 2024

qiuxin2012 commented Dec 3, 2024

mordonez commented Dec 5, 2024

qiuxin2012 commented Dec 6, 2024

mordonez commented Dec 6, 2024 • edited Loading

pepijndevos commented Dec 6, 2024

pepijndevos commented Dec 7, 2024

pepijndevos commented Dec 7, 2024

qiuxin2012 commented Dec 9, 2024 • edited Loading

qiuxin2012 commented Dec 9, 2024 • edited Loading

pepijndevos commented Dec 9, 2024

mordonez commented Dec 9, 2024

mordonez commented Dec 9, 2024

stereomato commented Feb 11, 2025

pepijndevos commented Nov 10, 2024 •

edited

Loading

mordonez commented Dec 6, 2024 •

edited

Loading

qiuxin2012 commented Dec 9, 2024 •

edited

Loading

qiuxin2012 commented Dec 9, 2024 •

edited

Loading