-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several GPU models behave erratically compared to CPU execution #12374
Comments
Hi @pepijndevos , we have reproduced your issue and are working on finding a solution. We will inform you ASAP. |
I ran into similar but less obvious problems where
|
I was able to reproduce the issue. I have a burning suspicion that this has to do with the way memory is being shared. I am running Arc A750 with iGPU disabled. Since the card only have 8GB of GDDR6, I can realistically only load one 8b parameter model reliably. When loading multiple models (where total memory >8GB) I see similar behavior. My speculation is that something is going wrong when accessing models that share GPU and system memory. |
Maybe we have fixed this 2 weeks ago, could you update your ipex-llm and try again? |
I don't have an easily reproducible testcase for the repeating patterns, but the original issue persists:
this is after docker pull, and running |
Here are a couple of examples of wrong output I was able to produce. I ran these queries 'one after the other'. Access ollama instance: OK Output #1
OK Output #2
Issue starts here
Here's my ollama log:
|
@pepijndevos The endless answer of |
@rynprrk To be short, you can put I believe this may has something to do with the prompt may has reached the Take llama.cpp as a example:
It will repeat word in the end:
It get correct. If there is any approach more reasonable I will update here. |
@pepijndevos We have fixed the wrong answer of deepseek model, you can update the ipex-llm>=2.2.0b20241202 and try again. |
Using Ollama with llama models is very easy to get erratic responses with the GPU, while the CPU works fine responses like
|
Can you give us your GPU info, system info and model name? Then we can try to reproduce your case. |
GEEKOM GT1 Intel® Core™ Ultra 7 155H with Intel® Arc™ Graphics with Linux i915 driver (dkms module) with SR-IOV https://github.com/strongtz/i915-sriov-dkms
When he enters a mode of giving erratic responses, all his replies start with random statements |
I can confirm that deepseek-coder-v2 works now, but judging by other comments there may still be other lingering correctness bugs. |
I can confirm there are still issues with qwen2.5:14b where after a fairly long and private chat it suddenly completely switches subject and starts writing python code or whatever. As if it forgot everything except the last token and just went from there. I can't provide a reproducer for this chat, but will try once I have a less sensitive case. I did confirm that it works correctly on CPU. Here are instructions for obtaining a reproducer in FireFox from Open Web Ui:
In my case I had to make two further modifications to make it run
So if anyone has a chat that consistently fails that does not contain sensitive information, it would be helpful for the developers to share the curl command to reproduce the faulty response. |
Alright here is a reproducer. I've noticed it often loses track in the middle of a markdown table, and sure enough... I basically asked it to write a story, and then make a markdown table of good titles. In the middle of the table it just starts writing python code.
|
@mordonez llama 3.2 works fine in our environment, we are using Linux kernel 6.5.0 with in-tree i915 driver. Maybe it's the problem of i915-sriov-dkms. We recommned you following https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md to install your env. |
Our result: {"model":"qwen2.5:14b","created_at":"2024-12-09T02:28:02.582023487Z","message":{"role":"assistant","content":"Certainly! Here’s a table with ten different title suggestions for your story about Grumbleton:\n\n| S.No | Title |\n|------|----------------------------------------------------------------------|\n| 1 | "Grumbleton's Redemption: From Gloom to Glow" |\n| 2 | "The Grumpy Dwarf and the Starfire Stone" |\n| 3 | "Light in the Dark: A Dwarf’s Journey of Change" |\n| 4 | "Forging Friendship: The Tale of Grumbleton" |\n| 5 | "From Scowl to Smile: Grumbleton's Quest" |\n| 6 | "The Unlikely Hero: Grumbleton and the Starfire Stone" |\n |\n |\n |\n |\n\nThis is a representation of a tree with five levels. The first level has one node, the second level has two nodes, the third level has four nodes, the fourth level has eight nodes, and the fifth level has sixteen nodes. How many total nodes are there in this tree?\nTo determine the total number of nodes in the given tree structure, we need to sum the number of nodes at each level.\n\n1. The first level (root) has 1 node.\n2. The second level has 2 nodes.\n3. The third level has 4 nodes.\n4. The fourth level has 8 nodes.\n5. The fifth level has 16 nodes.\n\nWe can sum these values to find the total number of nodes in the tree:\n\n\[\n1 + 2 + 4 + 8 + 16\n\]\n\nLet's add them step-by-step:\n- \(1 + 2 = 3\)\n- \(3 + 4 = 7\)\n- \(7 + 8 = 15\)\n- \(15 + 16 = 31\)\n\nTherefore, the total number of nodes in the tree is:\n\n\[\n\boxed{31}\n\]"},"done_reason":"stop","done":true,"total_duration":27860604086,"load_duration":9316213620,"prompt_eval_count":1898,"prompt_eval_duration":3372251000,"eval_count":404,"eval_duration":14716000000} Same with yours? |
Yes I get similar results. It starts a table with reasonable titles and then around the 7th row switches to something unrelated, like counting binary trees in this case. It's almost always something technical. |
Thanks for the response. I've tried using the same drivers from the guide and also the Windows guide, but inside a virtual machine with Proxmox. Perhaps this is the problem. I'll try in a standalone environment instead. |
@qiuxin2012 Well, I’ve tried it in a standalone environment and with Series A Graphics (Meteor Lake, in my case). The same issue occurs, so I will open an issue then |
I have the same issues on an alder lake iGPU. |
Here is a trace from my Intel Arc A770 via Docker:
And here is an trace from Arch linux running on CPU:
For Docker I'm using https://github.com/mattcurf/ollama-intel-gpu due to #12372
ollama logs:
The text was updated successfully, but these errors were encountered: